You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Fan Liya <li...@gmail.com> on 2020/03/10 13:12:23 UTC

[DISCUSS][Java] Support non-nullable vectors

Dear all,

A non-nullable vector is one that is guaranteed to contain no nulls. We
want to support non-nullable vectors in Java.

*Motivations:*
1. It is widely used in practice. For example, in a database engine, a
column can be declared as not null, so it cannot contain null values.
2.Non-nullable vectors has significant performance advantages compared with
their nullable conterparts, such as:
  1) the memory space of the validity buffer can be saved.
  2) manipulation of the validity buffer can be bypassed
  3) some if-else branches can be replaced by sequential instructions (by
the JIT compiler), leading to high throughput for the CPU pipeline.

*Potential Cost:*
For nullable vectors, there can be extra checks against the nullablility
flag. So we must change the code in a way that minimizes the cost.

*Proposed Changes:*
1. There is no need to create new vector classes. We add a final boolean to
the vector base classes as the nullability flag. The value of the flag can
be obtained from the field when creating the vector.
2. Add a method "boolean isNullable()" to the root interface ValueVector.
3. If a vector is non-nullable, its validity buffer should be an empty
buffer (not null, so much of the existing logic can be left unchanged).
4. For operations involving validity buffers (e.g. isNull, get, set), we
use the nullability flag to bypass manipulations to the validity buffer.

Therefore, it should be possible to support the feature with small code
changes.

BTW, please note that similar behaviors have already been supported in C++.

Would you please give your valueable feedback?

Best,
Liya Fan

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Fan Liya <li...@gmail.com>.
Hi Wes,

Thanks for your good suggestion, we will think about it.

Hi Jacques,

Thanks for your feedback. Please allow me to address some of your concerns:

> This change is undesirable as it optimizes one path and makes several
others behave in unintended ways.

Could you please elaborate more on this?

> What happens if a vector with nulls shows up? What happens if a user sets
a position to a null value in user code when this flag set?

Please note that in addition to the flag, we also have
BaseValueVector#nullable to tell if a vector is nullable.
So if  BaseValueVector#nullable = true, it is OK to set a null in any
position. However, if BaseValueVector#nullable = false and the user tries
to set null in some position, there must be some bugs in their code.

> If the answer to the above questions is the use is an advanced user, then
> why can't they just call:
> PlatformDependent.getInt(vector.memoryAddress() + position * 4).

> Why would we introduce something directly in the vector class for this
> specialized use case? If the user is advanced, that short memory access
> invocation seems fine to use.

IMO, the problems of directly invoking the PlatformDependent APIs include:
1. The validity buffer for the vector is still allocated anyway, so the
benefit of reduced memory consumption is lost.
2. The user can only use this trick in their client code. If they need to
use our libraries (e.g. flight, adapter, algorithm), they still have to pay
the performance price.
3. If we simply skip the validity buffer and manipulate the data buffer
directly, the vector will easily end up in an inconsistent state, causing
unexpected problems.

> The whole idea with Arrow is that if you have a specialized algorithm,
you can hand write memory reads and writes because you have predictable
memory layout.

This sounds reasonable.
IMO, supporting non-nullable vectors is not a specialized requirement.
Instead, it is a general requirement, and has a wide range of applications,
especially in relational data processing.
So it deserves some special treatment (I think), given the benefits
discussed above. Please think about it.

Best,
Liya Fan


On Thu, Sep 10, 2020 at 11:32 PM Wes McKinney <we...@gmail.com> wrote:

> I agree with Jacques here. Perhaps what is needed is an unsafe
> non-nullable array accessor layer, then there is no need for flags
> etc. We've already been writing a lot of such code in C++ (splitting
> between no-nulls and some-nulls paths, see also the BitBlockCounter
> stuff we've been doing, is such a thing available in Java yet?) and it
> hasn't seemed especially burdensome to me.
>
> On Thu, Sep 10, 2020 at 9:37 AM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > This change is undesirable as it optimizes one path and makes several
> > others behave in unintended ways.  What happens if a vector with nulls
> > shows up? What happens if a user sets a position to a null value in user
> > code when this flag set?
> >
> > If the answer to the above questions is the use is an advanced user, then
> > why can't they just call:
> > PlatformDependent.getInt(vector.memoryAddress() + position * 4).
> >
> > Why would we introduce something directly in the vector class for this
> > specialized use case? If the user is advanced, that short memory access
> > invocation seems fine to use. The whole idea with Arrow is that if you
> have
> > a specialized algorithm, you can hand write memory reads and writes
> because
> > you have predictable memory layout.
> >
> > On Wed, Sep 9, 2020 at 1:11 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Thanks a lot for your previous feedback.
> > >
> > > Now we have made some investigation and prepared an initial PR
> supporting
> > > the non-nullable IntVector [1], as this represents a common scenario.
> > > Some initial observations and conclusions can be made.
> > >
> > > The basic idea of the PR is to provide a global static final flag
> > > (NON_NULLABLE_VECTORS_ENABLED) to enable/disable the new feature.
> > > 1. When the flag is enabled, manipulations to the validity buffer
> would be
> > > by-passed if the vector is non-nullable.
> > > 2. When the flag is disabled, the behavior is identical to the original
> > > code.
> > >
> > > *First*, we want to show that the change is small. To support the
> > > non-nullable IntVector, we need to change classes IntVector,
> > > BaseFixedWidthVector (the superclass of IntVector), and BaseValueVector
> > > (the superclass of BaseFixedWidthVector). The amount of changes (lines
> of
> > > code) to each class is given below
> > >
> > > Class
> > >
> > > Additions (#lines)
> > >
> > > Deletions (#lines)
> > >
> > > Total # lines
> > >
> > > IntVector
> > >
> > > 12
> > >
> > > 5
> > >
> > > 370
> > >
> > > BaseFixedWidthVector
> > >
> > > 30
> > >
> > > 11
> > >
> > > 925
> > >
> > > BaseValueVector
> > >
> > > 7
> > >
> > > 0
> > >
> > > 240
> > >
> > >  It can be seen that the change is small, relative to the class size.
> In
> > > addition, to support additional vector types, we only need to change
> the
> > > sub-classes, and no more need to change the super classes.
> > >
> > > *Second*, we want to show that the performance improvement is notable.
> To
> > > see this, we give the performance data of the IntBenchmark (with some
> > > benchmarks added in the PR). To make the performance as good as
> possible,
> > > we enable the ARROW_ENABLE_UNSAFE_MEMORY_ACCESS flag and disable
> > > the ARROW_ENABLE_NULL_CHECK_FOR_GET flag. Below, we give the data with
> the
> > > non-nullable vector flag turned off and on, respectively.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *(Vector non-nullable flag off)Benchmark                         Mode
> Cnt
> > >      Score     Error  UnitsIntBenchmarks.getInt              avgt    5
> > >  384.948 ±   3.336  ns/opIntBenchmarks.isNull              avgt    5
> > > 1301.005 ±  54.239  ns/opIntBenchmarks.setIntDirectly      avgt    5
> > >  15387.486 ± 555.749  ns/opIntBenchmarks.setWithValueHolder  avgt    5
> > >  15251.351 ± 134.286  ns/opIntBenchmarks.setWithWriter       avgt    5
> > >  28595.586 ± 932.528  ns/op*
> > >
> > > *(Vector non-nullable flag on)*
> > >
> > >
> > >
> > >
> > > *IntBenchmarks.getInt              avgt    5    384.630 ±    1.547
> > >  ns/opIntBenchmarks.isNull              avgt    5      3.004 ±    0.110
> > >  ns/opIntBenchmarks.setIntDirectly      avgt    5  13511.605 ±  135.455
> > >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  13035.883 ±  196.081
> > >  ns/opIntBenchmarks.setWithWriter       avgt    5  24734.825 ± 1603.708
> > >  ns/op*
> > >
> > > For the *getInt *operation, there is little performance difference.
> This
> > > is because we disable the  ARROW_ENABLE_NULL_CHECK_FOR_GET flag, so
> > > manipulations to the validity buffer are by-passed, even if the
> > > non-nullable vector flag is off. When the
> ARROW_ENABLE_NULL_CHECK_FOR_GET
> > > is enabled, there is a 72.4% performance improvement gained by turning
> on
> > > the non-nullable vector flag.
> > >
> > > For the isNull operation, we see 3 orders of magnitude performance
> > > improvements by enabling the non-nullable vector flag.
> > >
> > > For other operations, we see 12.2%, 13.5% and 14.5% performance
> > > improvements by turning on the non-nullable vector flag.
> > >
> > > So it can be seen that notable performance improvements can be gained
> for
> > > non-nullable vectors.
> > >
> > > *Third*, we want to claim that for nullable vectors and scenarios when
> we
> > > turn off the non-nullable vector flag, the new changes do not introduce
> > > performance regression. Such concern is plausible, as our changes add
> some
> > > if-else branches to the code, which may degrade performance.
> > >
> > > We give the benchmark results of the original code, as below.
> > >
> > > *(original ocde)*
> > >
> > >
> > >
> > >
> > >
> > > *Benchmark                         Mode  Cnt      Score     Error
> > >  UnitsIntBenchmarks.getInt              avgt    5    383.511 ±   0.156
> > >  ns/opIntBenchmarks.isNull              avgt    5   1274.271 ±  19.092
> > >  ns/opIntBenchmarks.setIntDirectly      avgt    5  15162.219 ± 194.956
> > >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  15247.640 ± 153.103
> > >  ns/opIntBenchmarks.setWithWriter       avgt    5  28587.780 ± 160.458
> > >  ns/op*
> > >
> > > By comparing this set of results with the above results with
> non-nullable
> > > vector flag disabled, little performance difference can be observed,
> > > indicating no performance regression.
> > >
> > > By examining the generated assembly, it can be seen that JIT is smart
> > > enough to remove the if-else branch completely (the below screen-shot
> gives
> > > an example where the if branch is optimized away in the assembly)
> > >
> > > So if some users do not like this feature, because their vectors are
> > > always nullable, they can simply disable the flag, and no performance
> > > difference can be observed.
> > >
> > > Would you please give some feedback?
> > > If it looks good to you, maybe we can go ahead to support other types
> of
> > > vectors.
> > > Thank you in advance.
> > >
> > > Liya Fan
> > >
> > > [image: image.png]
> > >
> > > [1] https://github.com/apache/arrow/pull/8147
> > >
> > > On Fri, Mar 13, 2020 at 9:47 PM Fan Liya <li...@gmail.com> wrote:
> > >
> > >> Hi Jacques,
> > >>
> > >> Thanks a lot for your valuable comments.
> > >>
> > >> I agree with you that collapsing nullable and non-nullable
> > >> implementations is a good idea, and it does not contradict with the
> idea of
> > >> introducing a fast code path, if it does not introduce much cost or
> code
> > >> complexity.
> > >>
> > >> The idea of word level checking is interesting.
> > >>
> > >> As you suggested, we will do more investigations and reconsider the
> > >> problem from a broader perspective.
> > >>
> > >> Best,
> > >> Liya Fan
> > >>
> > >>
> > >>
> > >> On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <ja...@apache.org>
> > >> wrote:
> > >>
> > >>> Generally Ive found that this isnt an important optimization in the
> use
> > >>> cases we see. Memory overhead, especially with our Java shared
> allocation
> > >>> scheme is nominal. Optimizing null checks at the word level usually
> is
> > >>> much
> > >>> more impactful since non null and null runs are much more common on a
> > >>> shorter window common than they seeing those declared.
> > >>>
> > >>> In other words, I'd suggest you look at your problem with a broader
> > >>> perspective and see whether you're actually focused on optimizing the
> > >>> most
> > >>> important dimension.
> > >>>
> > >>> As an aside, the original Arrow Java code actually treated these
> concepts
> > >>> more distinctly and we consciously made a decision to collapse them
> to
> > >>> simplify real world use.
> > >>>
> > >>> I do think it makes to add a dirty read interface if you want. This
> would
> > >>> allow consumers of the interface to behave efficiently if they
> wanted to.
> > >>>
> > >>> One last note, efficient evaluation and processing should generally
> > >>> always
> > >>> work at the validity word level. Adding an extra if check at the word
> > >>> versus extra complexity of having an early out per batch seems like a
> > >>> pretty small life in the grand scheme of processing.
> > >>>
> > >>> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com>
> wrote:
> > >>>
> > >>> > > And there is a "nullable" metadata-only flag at the
> > >>> > > Field level. Could the same kinds of optimizations be
> implemented in
> > >>> > > Java without introducing a "nullable" concept?
> > >>> >
> > >>> > Note Liya Fan did suggest pulling the nullable flag from the Field
> > >>> when the
> > >>> > vector is created in item (1) of the proposed changes.
> > >>> >
> > >>> > Brian
> > >>> >
> > >>> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com>
> wrote:
> > >>> >
> > >>> > > Hi Micah,
> > >>> > >
> > >>> > > Thanks a lot for your valuable comments. Please see my comments
> > >>> inline.
> > >>> > >
> > >>> > > > I'm a little concerned that this will change assumptions for at
> > >>> least
> > >>> > > some
> > >>> > > > of the clients using the library (some might always rely on the
> > >>> > validity
> > >>> > > > buffer being present).
> > >>> > >
> > >>> > > I can understand your concern and I am also concerned.
> > >>> > > IMO, the client should not depend on this assumption, as the
> > >>> > specification
> > >>> > > says "Arrays having a 0 null count may choose to not allocate the
> > >>> > validity
> > >>> > > bitmap." [1]
> > >>> > > That being said, I think it would be safe to provide a global
> flag to
> > >>> > > switch on/off the feature (as you suggested).
> > >>> > >
> > >>> > > > I think this is a good feature to have for the reasons you
> > >>> mentioned.
> > >>> > It
> > >>> > > > seems like there would need to be some sort of configuration
> bit
> > >>> to set
> > >>> > > for
> > >>> > > > this behavior.
> > >>> > >
> > >>> > > Good suggestion. We should be able to switch on and off the
> feature
> > >>> with
> > >>> > a
> > >>> > > single global flag.
> > >>> > >
> > >>> > > > But, I'd be worried about code complexity this would
> > >>> > > > introduce.
> > >>> > >
> > >>> > > I agree with you that code complexity is an important factor to
> > >>> consider.
> > >>> > > IMO, our proposal should not involve too much code change, or
> > >>> increase
> > >>> > code
> > >>> > > complexity too much.
> > >>> > > To prove this, maybe we need to show some small experimental code
> > >>> change.
> > >>> > >
> > >>> > > Best,
> > >>> > > Liya Fan
> > >>> > >
> > >>> > > [1]
> https://arrow.apache.org/docs/format/Columnar.html#logical-types
> > >>> > >
> > >>> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <
> > >>> emkornfield@gmail.com>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi Liya Fan,
> > >>> > > > I'm a little concerned that this will change assumptions for at
> > >>> least
> > >>> > > some
> > >>> > > > of the clients using the library (some might always rely on the
> > >>> > validity
> > >>> > > > buffer being present).
> > >>> > > >
> > >>> > > > I think this is a good feature to have for the reasons you
> > >>> mentioned.
> > >>> > It
> > >>> > > > seems like there would need to be some sort of configuration
> bit
> > >>> to set
> > >>> > > for
> > >>> > > > this behavior. But, I'd be worried about code complexity this
> would
> > >>> > > > introduce.
> > >>> > > >
> > >>> > > > Thanks,
> > >>> > > > Micah
> > >>> > > >
> > >>> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <liya.fan03@gmail.com
> >
> > >>> wrote:
> > >>> > > >
> > >>> > > > > Hi Wes,
> > >>> > > > >
> > >>> > > > > Thanks a lot for your quick reply.
> > >>> > > > > I think what you mentioned is almost exactly what we want to
> do
> > >>> in
> > >>> > > > Java.The
> > >>> > > > > concept is not important.
> > >>> > > > >
> > >>> > > > > Maybe there are only some minor differences:
> > >>> > > > > 1. In C++, the null_count is mutable, while for Java, once a
> > >>> vector
> > >>> > is
> > >>> > > > > constructed as non-nullable, its null count can only be 0.
> > >>> > > > > 2. In C++, a non-nullable array's validity buffer is null,
> while
> > >>> in
> > >>> > > Java,
> > >>> > > > > the buffer is an empty buffer, and cannot be changed.
> > >>> > > > >
> > >>> > > > > Best,
> > >>> > > > > Liya Fan
> > >>> > > > >
> > >>> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <
> > >>> wesmckinn@gmail.com>
> > >>> > > > wrote:
> > >>> > > > >
> > >>> > > > > > hi Liya,
> > >>> > > > > >
> > >>> > > > > > In C++ we elect certain faster code paths when the null
> count
> > >>> is 0
> > >>> > or
> > >>> > > > > > computed to be zero. When the null count is 0, we do not
> > >>> allocate a
> > >>> > > > > > validity bitmap. And there is a "nullable" metadata-only
> flag
> > >>> at
> > >>> > the
> > >>> > > > > > Field level. Could the same kinds of optimizations be
> > >>> implemented
> > >>> > in
> > >>> > > > > > Java without introducing a "nullable" concept?
> > >>> > > > > >
> > >>> > > > > > - Wes
> > >>> > > > > >
> > >>> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <
> liya.fan03@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > > > > >
> > >>> > > > > > > Dear all,
> > >>> > > > > > >
> > >>> > > > > > > A non-nullable vector is one that is guaranteed to
> contain no
> > >>> > > nulls.
> > >>> > > > We
> > >>> > > > > > > want to support non-nullable vectors in Java.
> > >>> > > > > > >
> > >>> > > > > > > *Motivations:*
> > >>> > > > > > > 1. It is widely used in practice. For example, in a
> database
> > >>> > > engine,
> > >>> > > > a
> > >>> > > > > > > column can be declared as not null, so it cannot contain
> null
> > >>> > > values.
> > >>> > > > > > > 2.Non-nullable vectors has significant performance
> advantages
> > >>> > > > compared
> > >>> > > > > > with
> > >>> > > > > > > their nullable conterparts, such as:
> > >>> > > > > > >   1) the memory space of the validity buffer can be
> saved.
> > >>> > > > > > >   2) manipulation of the validity buffer can be bypassed
> > >>> > > > > > >   3) some if-else branches can be replaced by sequential
> > >>> > > instructions
> > >>> > > > > (by
> > >>> > > > > > > the JIT compiler), leading to high throughput for the CPU
> > >>> > pipeline.
> > >>> > > > > > >
> > >>> > > > > > > *Potential Cost:*
> > >>> > > > > > > For nullable vectors, there can be extra checks against
> the
> > >>> > > > > nullablility
> > >>> > > > > > > flag. So we must change the code in a way that minimizes
> the
> > >>> > cost.
> > >>> > > > > > >
> > >>> > > > > > > *Proposed Changes:*
> > >>> > > > > > > 1. There is no need to create new vector classes. We add
> a
> > >>> final
> > >>> > > > > boolean
> > >>> > > > > > to
> > >>> > > > > > > the vector base classes as the nullability flag. The
> value
> > >>> of the
> > >>> > > > flag
> > >>> > > > > > can
> > >>> > > > > > > be obtained from the field when creating the vector.
> > >>> > > > > > > 2. Add a method "boolean isNullable()" to the root
> interface
> > >>> > > > > ValueVector.
> > >>> > > > > > > 3. If a vector is non-nullable, its validity buffer
> should
> > >>> be an
> > >>> > > > empty
> > >>> > > > > > > buffer (not null, so much of the existing logic can be
> left
> > >>> > > > unchanged).
> > >>> > > > > > > 4. For operations involving validity buffers (e.g.
> isNull,
> > >>> get,
> > >>> > > set),
> > >>> > > > > we
> > >>> > > > > > > use the nullability flag to bypass manipulations to the
> > >>> validity
> > >>> > > > > buffer.
> > >>> > > > > > >
> > >>> > > > > > > Therefore, it should be possible to support the feature
> with
> > >>> > small
> > >>> > > > code
> > >>> > > > > > > changes.
> > >>> > > > > > >
> > >>> > > > > > > BTW, please note that similar behaviors have already been
> > >>> > supported
> > >>> > > > in
> > >>> > > > > > C++.
> > >>> > > > > > >
> > >>> > > > > > > Would you please give your valueable feedback?
> > >>> > > > > > >
> > >>> > > > > > > Best,
> > >>> > > > > > > Liya Fan
> > >>> > > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Wes McKinney <we...@gmail.com>.
I agree with Jacques here. Perhaps what is needed is an unsafe
non-nullable array accessor layer, then there is no need for flags
etc. We've already been writing a lot of such code in C++ (splitting
between no-nulls and some-nulls paths, see also the BitBlockCounter
stuff we've been doing, is such a thing available in Java yet?) and it
hasn't seemed especially burdensome to me.

On Thu, Sep 10, 2020 at 9:37 AM Jacques Nadeau <ja...@apache.org> wrote:
>
> This change is undesirable as it optimizes one path and makes several
> others behave in unintended ways.  What happens if a vector with nulls
> shows up? What happens if a user sets a position to a null value in user
> code when this flag set?
>
> If the answer to the above questions is the use is an advanced user, then
> why can't they just call:
> PlatformDependent.getInt(vector.memoryAddress() + position * 4).
>
> Why would we introduce something directly in the vector class for this
> specialized use case? If the user is advanced, that short memory access
> invocation seems fine to use. The whole idea with Arrow is that if you have
> a specialized algorithm, you can hand write memory reads and writes because
> you have predictable memory layout.
>
> On Wed, Sep 9, 2020 at 1:11 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi all,
> >
> > Thanks a lot for your previous feedback.
> >
> > Now we have made some investigation and prepared an initial PR supporting
> > the non-nullable IntVector [1], as this represents a common scenario.
> > Some initial observations and conclusions can be made.
> >
> > The basic idea of the PR is to provide a global static final flag
> > (NON_NULLABLE_VECTORS_ENABLED) to enable/disable the new feature.
> > 1. When the flag is enabled, manipulations to the validity buffer would be
> > by-passed if the vector is non-nullable.
> > 2. When the flag is disabled, the behavior is identical to the original
> > code.
> >
> > *First*, we want to show that the change is small. To support the
> > non-nullable IntVector, we need to change classes IntVector,
> > BaseFixedWidthVector (the superclass of IntVector), and BaseValueVector
> > (the superclass of BaseFixedWidthVector). The amount of changes (lines of
> > code) to each class is given below
> >
> > Class
> >
> > Additions (#lines)
> >
> > Deletions (#lines)
> >
> > Total # lines
> >
> > IntVector
> >
> > 12
> >
> > 5
> >
> > 370
> >
> > BaseFixedWidthVector
> >
> > 30
> >
> > 11
> >
> > 925
> >
> > BaseValueVector
> >
> > 7
> >
> > 0
> >
> > 240
> >
> >  It can be seen that the change is small, relative to the class size. In
> > addition, to support additional vector types, we only need to change the
> > sub-classes, and no more need to change the super classes.
> >
> > *Second*, we want to show that the performance improvement is notable. To
> > see this, we give the performance data of the IntBenchmark (with some
> > benchmarks added in the PR). To make the performance as good as possible,
> > we enable the ARROW_ENABLE_UNSAFE_MEMORY_ACCESS flag and disable
> > the ARROW_ENABLE_NULL_CHECK_FOR_GET flag. Below, we give the data with the
> > non-nullable vector flag turned off and on, respectively.
> >
> >
> >
> >
> >
> >
> >
> >
> > *(Vector non-nullable flag off)Benchmark                         Mode  Cnt
> >      Score     Error  UnitsIntBenchmarks.getInt              avgt    5
> >  384.948 ±   3.336  ns/opIntBenchmarks.isNull              avgt    5
> > 1301.005 ±  54.239  ns/opIntBenchmarks.setIntDirectly      avgt    5
> >  15387.486 ± 555.749  ns/opIntBenchmarks.setWithValueHolder  avgt    5
> >  15251.351 ± 134.286  ns/opIntBenchmarks.setWithWriter       avgt    5
> >  28595.586 ± 932.528  ns/op*
> >
> > *(Vector non-nullable flag on)*
> >
> >
> >
> >
> > *IntBenchmarks.getInt              avgt    5    384.630 ±    1.547
> >  ns/opIntBenchmarks.isNull              avgt    5      3.004 ±    0.110
> >  ns/opIntBenchmarks.setIntDirectly      avgt    5  13511.605 ±  135.455
> >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  13035.883 ±  196.081
> >  ns/opIntBenchmarks.setWithWriter       avgt    5  24734.825 ± 1603.708
> >  ns/op*
> >
> > For the *getInt *operation, there is little performance difference. This
> > is because we disable the  ARROW_ENABLE_NULL_CHECK_FOR_GET flag, so
> > manipulations to the validity buffer are by-passed, even if the
> > non-nullable vector flag is off. When the  ARROW_ENABLE_NULL_CHECK_FOR_GET
> > is enabled, there is a 72.4% performance improvement gained by turning on
> > the non-nullable vector flag.
> >
> > For the isNull operation, we see 3 orders of magnitude performance
> > improvements by enabling the non-nullable vector flag.
> >
> > For other operations, we see 12.2%, 13.5% and 14.5% performance
> > improvements by turning on the non-nullable vector flag.
> >
> > So it can be seen that notable performance improvements can be gained for
> > non-nullable vectors.
> >
> > *Third*, we want to claim that for nullable vectors and scenarios when we
> > turn off the non-nullable vector flag, the new changes do not introduce
> > performance regression. Such concern is plausible, as our changes add some
> > if-else branches to the code, which may degrade performance.
> >
> > We give the benchmark results of the original code, as below.
> >
> > *(original ocde)*
> >
> >
> >
> >
> >
> > *Benchmark                         Mode  Cnt      Score     Error
> >  UnitsIntBenchmarks.getInt              avgt    5    383.511 ±   0.156
> >  ns/opIntBenchmarks.isNull              avgt    5   1274.271 ±  19.092
> >  ns/opIntBenchmarks.setIntDirectly      avgt    5  15162.219 ± 194.956
> >  ns/opIntBenchmarks.setWithValueHolder  avgt    5  15247.640 ± 153.103
> >  ns/opIntBenchmarks.setWithWriter       avgt    5  28587.780 ± 160.458
> >  ns/op*
> >
> > By comparing this set of results with the above results with non-nullable
> > vector flag disabled, little performance difference can be observed,
> > indicating no performance regression.
> >
> > By examining the generated assembly, it can be seen that JIT is smart
> > enough to remove the if-else branch completely (the below screen-shot gives
> > an example where the if branch is optimized away in the assembly)
> >
> > So if some users do not like this feature, because their vectors are
> > always nullable, they can simply disable the flag, and no performance
> > difference can be observed.
> >
> > Would you please give some feedback?
> > If it looks good to you, maybe we can go ahead to support other types of
> > vectors.
> > Thank you in advance.
> >
> > Liya Fan
> >
> > [image: image.png]
> >
> > [1] https://github.com/apache/arrow/pull/8147
> >
> > On Fri, Mar 13, 2020 at 9:47 PM Fan Liya <li...@gmail.com> wrote:
> >
> >> Hi Jacques,
> >>
> >> Thanks a lot for your valuable comments.
> >>
> >> I agree with you that collapsing nullable and non-nullable
> >> implementations is a good idea, and it does not contradict with the idea of
> >> introducing a fast code path, if it does not introduce much cost or code
> >> complexity.
> >>
> >> The idea of word level checking is interesting.
> >>
> >> As you suggested, we will do more investigations and reconsider the
> >> problem from a broader perspective.
> >>
> >> Best,
> >> Liya Fan
> >>
> >>
> >>
> >> On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >>> Generally Ive found that this isnt an important optimization in the use
> >>> cases we see. Memory overhead, especially with our Java shared allocation
> >>> scheme is nominal. Optimizing null checks at the word level usually is
> >>> much
> >>> more impactful since non null and null runs are much more common on a
> >>> shorter window common than they seeing those declared.
> >>>
> >>> In other words, I'd suggest you look at your problem with a broader
> >>> perspective and see whether you're actually focused on optimizing the
> >>> most
> >>> important dimension.
> >>>
> >>> As an aside, the original Arrow Java code actually treated these concepts
> >>> more distinctly and we consciously made a decision to collapse them to
> >>> simplify real world use.
> >>>
> >>> I do think it makes to add a dirty read interface if you want. This would
> >>> allow consumers of the interface to behave efficiently if they wanted to.
> >>>
> >>> One last note, efficient evaluation and processing should generally
> >>> always
> >>> work at the validity word level. Adding an extra if check at the word
> >>> versus extra complexity of having an early out per batch seems like a
> >>> pretty small life in the grand scheme of processing.
> >>>
> >>> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com> wrote:
> >>>
> >>> > > And there is a "nullable" metadata-only flag at the
> >>> > > Field level. Could the same kinds of optimizations be implemented in
> >>> > > Java without introducing a "nullable" concept?
> >>> >
> >>> > Note Liya Fan did suggest pulling the nullable flag from the Field
> >>> when the
> >>> > vector is created in item (1) of the proposed changes.
> >>> >
> >>> > Brian
> >>> >
> >>> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:
> >>> >
> >>> > > Hi Micah,
> >>> > >
> >>> > > Thanks a lot for your valuable comments. Please see my comments
> >>> inline.
> >>> > >
> >>> > > > I'm a little concerned that this will change assumptions for at
> >>> least
> >>> > > some
> >>> > > > of the clients using the library (some might always rely on the
> >>> > validity
> >>> > > > buffer being present).
> >>> > >
> >>> > > I can understand your concern and I am also concerned.
> >>> > > IMO, the client should not depend on this assumption, as the
> >>> > specification
> >>> > > says "Arrays having a 0 null count may choose to not allocate the
> >>> > validity
> >>> > > bitmap." [1]
> >>> > > That being said, I think it would be safe to provide a global flag to
> >>> > > switch on/off the feature (as you suggested).
> >>> > >
> >>> > > > I think this is a good feature to have for the reasons you
> >>> mentioned.
> >>> > It
> >>> > > > seems like there would need to be some sort of configuration bit
> >>> to set
> >>> > > for
> >>> > > > this behavior.
> >>> > >
> >>> > > Good suggestion. We should be able to switch on and off the feature
> >>> with
> >>> > a
> >>> > > single global flag.
> >>> > >
> >>> > > > But, I'd be worried about code complexity this would
> >>> > > > introduce.
> >>> > >
> >>> > > I agree with you that code complexity is an important factor to
> >>> consider.
> >>> > > IMO, our proposal should not involve too much code change, or
> >>> increase
> >>> > code
> >>> > > complexity too much.
> >>> > > To prove this, maybe we need to show some small experimental code
> >>> change.
> >>> > >
> >>> > > Best,
> >>> > > Liya Fan
> >>> > >
> >>> > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
> >>> > >
> >>> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>> > > wrote:
> >>> > >
> >>> > > > Hi Liya Fan,
> >>> > > > I'm a little concerned that this will change assumptions for at
> >>> least
> >>> > > some
> >>> > > > of the clients using the library (some might always rely on the
> >>> > validity
> >>> > > > buffer being present).
> >>> > > >
> >>> > > > I think this is a good feature to have for the reasons you
> >>> mentioned.
> >>> > It
> >>> > > > seems like there would need to be some sort of configuration bit
> >>> to set
> >>> > > for
> >>> > > > this behavior. But, I'd be worried about code complexity this would
> >>> > > > introduce.
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Micah
> >>> > > >
> >>> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com>
> >>> wrote:
> >>> > > >
> >>> > > > > Hi Wes,
> >>> > > > >
> >>> > > > > Thanks a lot for your quick reply.
> >>> > > > > I think what you mentioned is almost exactly what we want to do
> >>> in
> >>> > > > Java.The
> >>> > > > > concept is not important.
> >>> > > > >
> >>> > > > > Maybe there are only some minor differences:
> >>> > > > > 1. In C++, the null_count is mutable, while for Java, once a
> >>> vector
> >>> > is
> >>> > > > > constructed as non-nullable, its null count can only be 0.
> >>> > > > > 2. In C++, a non-nullable array's validity buffer is null, while
> >>> in
> >>> > > Java,
> >>> > > > > the buffer is an empty buffer, and cannot be changed.
> >>> > > > >
> >>> > > > > Best,
> >>> > > > > Liya Fan
> >>> > > > >
> >>> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <
> >>> wesmckinn@gmail.com>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > > hi Liya,
> >>> > > > > >
> >>> > > > > > In C++ we elect certain faster code paths when the null count
> >>> is 0
> >>> > or
> >>> > > > > > computed to be zero. When the null count is 0, we do not
> >>> allocate a
> >>> > > > > > validity bitmap. And there is a "nullable" metadata-only flag
> >>> at
> >>> > the
> >>> > > > > > Field level. Could the same kinds of optimizations be
> >>> implemented
> >>> > in
> >>> > > > > > Java without introducing a "nullable" concept?
> >>> > > > > >
> >>> > > > > > - Wes
> >>> > > > > >
> >>> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <liya.fan03@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > > > > >
> >>> > > > > > > Dear all,
> >>> > > > > > >
> >>> > > > > > > A non-nullable vector is one that is guaranteed to contain no
> >>> > > nulls.
> >>> > > > We
> >>> > > > > > > want to support non-nullable vectors in Java.
> >>> > > > > > >
> >>> > > > > > > *Motivations:*
> >>> > > > > > > 1. It is widely used in practice. For example, in a database
> >>> > > engine,
> >>> > > > a
> >>> > > > > > > column can be declared as not null, so it cannot contain null
> >>> > > values.
> >>> > > > > > > 2.Non-nullable vectors has significant performance advantages
> >>> > > > compared
> >>> > > > > > with
> >>> > > > > > > their nullable conterparts, such as:
> >>> > > > > > >   1) the memory space of the validity buffer can be saved.
> >>> > > > > > >   2) manipulation of the validity buffer can be bypassed
> >>> > > > > > >   3) some if-else branches can be replaced by sequential
> >>> > > instructions
> >>> > > > > (by
> >>> > > > > > > the JIT compiler), leading to high throughput for the CPU
> >>> > pipeline.
> >>> > > > > > >
> >>> > > > > > > *Potential Cost:*
> >>> > > > > > > For nullable vectors, there can be extra checks against the
> >>> > > > > nullablility
> >>> > > > > > > flag. So we must change the code in a way that minimizes the
> >>> > cost.
> >>> > > > > > >
> >>> > > > > > > *Proposed Changes:*
> >>> > > > > > > 1. There is no need to create new vector classes. We add a
> >>> final
> >>> > > > > boolean
> >>> > > > > > to
> >>> > > > > > > the vector base classes as the nullability flag. The value
> >>> of the
> >>> > > > flag
> >>> > > > > > can
> >>> > > > > > > be obtained from the field when creating the vector.
> >>> > > > > > > 2. Add a method "boolean isNullable()" to the root interface
> >>> > > > > ValueVector.
> >>> > > > > > > 3. If a vector is non-nullable, its validity buffer should
> >>> be an
> >>> > > > empty
> >>> > > > > > > buffer (not null, so much of the existing logic can be left
> >>> > > > unchanged).
> >>> > > > > > > 4. For operations involving validity buffers (e.g. isNull,
> >>> get,
> >>> > > set),
> >>> > > > > we
> >>> > > > > > > use the nullability flag to bypass manipulations to the
> >>> validity
> >>> > > > > buffer.
> >>> > > > > > >
> >>> > > > > > > Therefore, it should be possible to support the feature with
> >>> > small
> >>> > > > code
> >>> > > > > > > changes.
> >>> > > > > > >
> >>> > > > > > > BTW, please note that similar behaviors have already been
> >>> > supported
> >>> > > > in
> >>> > > > > > C++.
> >>> > > > > > >
> >>> > > > > > > Would you please give your valueable feedback?
> >>> > > > > > >
> >>> > > > > > > Best,
> >>> > > > > > > Liya Fan
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Jacques Nadeau <ja...@apache.org>.
This change is undesirable as it optimizes one path and makes several
others behave in unintended ways.  What happens if a vector with nulls
shows up? What happens if a user sets a position to a null value in user
code when this flag set?

If the answer to the above questions is the use is an advanced user, then
why can't they just call:
PlatformDependent.getInt(vector.memoryAddress() + position * 4).

Why would we introduce something directly in the vector class for this
specialized use case? If the user is advanced, that short memory access
invocation seems fine to use. The whole idea with Arrow is that if you have
a specialized algorithm, you can hand write memory reads and writes because
you have predictable memory layout.

On Wed, Sep 9, 2020 at 1:11 AM Fan Liya <li...@gmail.com> wrote:

> Hi all,
>
> Thanks a lot for your previous feedback.
>
> Now we have made some investigation and prepared an initial PR supporting
> the non-nullable IntVector [1], as this represents a common scenario.
> Some initial observations and conclusions can be made.
>
> The basic idea of the PR is to provide a global static final flag
> (NON_NULLABLE_VECTORS_ENABLED) to enable/disable the new feature.
> 1. When the flag is enabled, manipulations to the validity buffer would be
> by-passed if the vector is non-nullable.
> 2. When the flag is disabled, the behavior is identical to the original
> code.
>
> *First*, we want to show that the change is small. To support the
> non-nullable IntVector, we need to change classes IntVector,
> BaseFixedWidthVector (the superclass of IntVector), and BaseValueVector
> (the superclass of BaseFixedWidthVector). The amount of changes (lines of
> code) to each class is given below
>
> Class
>
> Additions (#lines)
>
> Deletions (#lines)
>
> Total # lines
>
> IntVector
>
> 12
>
> 5
>
> 370
>
> BaseFixedWidthVector
>
> 30
>
> 11
>
> 925
>
> BaseValueVector
>
> 7
>
> 0
>
> 240
>
>  It can be seen that the change is small, relative to the class size. In
> addition, to support additional vector types, we only need to change the
> sub-classes, and no more need to change the super classes.
>
> *Second*, we want to show that the performance improvement is notable. To
> see this, we give the performance data of the IntBenchmark (with some
> benchmarks added in the PR). To make the performance as good as possible,
> we enable the ARROW_ENABLE_UNSAFE_MEMORY_ACCESS flag and disable
> the ARROW_ENABLE_NULL_CHECK_FOR_GET flag. Below, we give the data with the
> non-nullable vector flag turned off and on, respectively.
>
>
>
>
>
>
>
>
> *(Vector non-nullable flag off)Benchmark                         Mode  Cnt
>      Score     Error  UnitsIntBenchmarks.getInt              avgt    5
>  384.948 ±   3.336  ns/opIntBenchmarks.isNull              avgt    5
> 1301.005 ±  54.239  ns/opIntBenchmarks.setIntDirectly      avgt    5
>  15387.486 ± 555.749  ns/opIntBenchmarks.setWithValueHolder  avgt    5
>  15251.351 ± 134.286  ns/opIntBenchmarks.setWithWriter       avgt    5
>  28595.586 ± 932.528  ns/op*
>
> *(Vector non-nullable flag on)*
>
>
>
>
> *IntBenchmarks.getInt              avgt    5    384.630 ±    1.547
>  ns/opIntBenchmarks.isNull              avgt    5      3.004 ±    0.110
>  ns/opIntBenchmarks.setIntDirectly      avgt    5  13511.605 ±  135.455
>  ns/opIntBenchmarks.setWithValueHolder  avgt    5  13035.883 ±  196.081
>  ns/opIntBenchmarks.setWithWriter       avgt    5  24734.825 ± 1603.708
>  ns/op*
>
> For the *getInt *operation, there is little performance difference. This
> is because we disable the  ARROW_ENABLE_NULL_CHECK_FOR_GET flag, so
> manipulations to the validity buffer are by-passed, even if the
> non-nullable vector flag is off. When the  ARROW_ENABLE_NULL_CHECK_FOR_GET
> is enabled, there is a 72.4% performance improvement gained by turning on
> the non-nullable vector flag.
>
> For the isNull operation, we see 3 orders of magnitude performance
> improvements by enabling the non-nullable vector flag.
>
> For other operations, we see 12.2%, 13.5% and 14.5% performance
> improvements by turning on the non-nullable vector flag.
>
> So it can be seen that notable performance improvements can be gained for
> non-nullable vectors.
>
> *Third*, we want to claim that for nullable vectors and scenarios when we
> turn off the non-nullable vector flag, the new changes do not introduce
> performance regression. Such concern is plausible, as our changes add some
> if-else branches to the code, which may degrade performance.
>
> We give the benchmark results of the original code, as below.
>
> *(original ocde)*
>
>
>
>
>
> *Benchmark                         Mode  Cnt      Score     Error
>  UnitsIntBenchmarks.getInt              avgt    5    383.511 ±   0.156
>  ns/opIntBenchmarks.isNull              avgt    5   1274.271 ±  19.092
>  ns/opIntBenchmarks.setIntDirectly      avgt    5  15162.219 ± 194.956
>  ns/opIntBenchmarks.setWithValueHolder  avgt    5  15247.640 ± 153.103
>  ns/opIntBenchmarks.setWithWriter       avgt    5  28587.780 ± 160.458
>  ns/op*
>
> By comparing this set of results with the above results with non-nullable
> vector flag disabled, little performance difference can be observed,
> indicating no performance regression.
>
> By examining the generated assembly, it can be seen that JIT is smart
> enough to remove the if-else branch completely (the below screen-shot gives
> an example where the if branch is optimized away in the assembly)
>
> So if some users do not like this feature, because their vectors are
> always nullable, they can simply disable the flag, and no performance
> difference can be observed.
>
> Would you please give some feedback?
> If it looks good to you, maybe we can go ahead to support other types of
> vectors.
> Thank you in advance.
>
> Liya Fan
>
> [image: image.png]
>
> [1] https://github.com/apache/arrow/pull/8147
>
> On Fri, Mar 13, 2020 at 9:47 PM Fan Liya <li...@gmail.com> wrote:
>
>> Hi Jacques,
>>
>> Thanks a lot for your valuable comments.
>>
>> I agree with you that collapsing nullable and non-nullable
>> implementations is a good idea, and it does not contradict with the idea of
>> introducing a fast code path, if it does not introduce much cost or code
>> complexity.
>>
>> The idea of word level checking is interesting.
>>
>> As you suggested, we will do more investigations and reconsider the
>> problem from a broader perspective.
>>
>> Best,
>> Liya Fan
>>
>>
>>
>> On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> Generally Ive found that this isnt an important optimization in the use
>>> cases we see. Memory overhead, especially with our Java shared allocation
>>> scheme is nominal. Optimizing null checks at the word level usually is
>>> much
>>> more impactful since non null and null runs are much more common on a
>>> shorter window common than they seeing those declared.
>>>
>>> In other words, I'd suggest you look at your problem with a broader
>>> perspective and see whether you're actually focused on optimizing the
>>> most
>>> important dimension.
>>>
>>> As an aside, the original Arrow Java code actually treated these concepts
>>> more distinctly and we consciously made a decision to collapse them to
>>> simplify real world use.
>>>
>>> I do think it makes to add a dirty read interface if you want. This would
>>> allow consumers of the interface to behave efficiently if they wanted to.
>>>
>>> One last note, efficient evaluation and processing should generally
>>> always
>>> work at the validity word level. Adding an extra if check at the word
>>> versus extra complexity of having an early out per batch seems like a
>>> pretty small life in the grand scheme of processing.
>>>
>>> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com> wrote:
>>>
>>> > > And there is a "nullable" metadata-only flag at the
>>> > > Field level. Could the same kinds of optimizations be implemented in
>>> > > Java without introducing a "nullable" concept?
>>> >
>>> > Note Liya Fan did suggest pulling the nullable flag from the Field
>>> when the
>>> > vector is created in item (1) of the proposed changes.
>>> >
>>> > Brian
>>> >
>>> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:
>>> >
>>> > > Hi Micah,
>>> > >
>>> > > Thanks a lot for your valuable comments. Please see my comments
>>> inline.
>>> > >
>>> > > > I'm a little concerned that this will change assumptions for at
>>> least
>>> > > some
>>> > > > of the clients using the library (some might always rely on the
>>> > validity
>>> > > > buffer being present).
>>> > >
>>> > > I can understand your concern and I am also concerned.
>>> > > IMO, the client should not depend on this assumption, as the
>>> > specification
>>> > > says "Arrays having a 0 null count may choose to not allocate the
>>> > validity
>>> > > bitmap." [1]
>>> > > That being said, I think it would be safe to provide a global flag to
>>> > > switch on/off the feature (as you suggested).
>>> > >
>>> > > > I think this is a good feature to have for the reasons you
>>> mentioned.
>>> > It
>>> > > > seems like there would need to be some sort of configuration bit
>>> to set
>>> > > for
>>> > > > this behavior.
>>> > >
>>> > > Good suggestion. We should be able to switch on and off the feature
>>> with
>>> > a
>>> > > single global flag.
>>> > >
>>> > > > But, I'd be worried about code complexity this would
>>> > > > introduce.
>>> > >
>>> > > I agree with you that code complexity is an important factor to
>>> consider.
>>> > > IMO, our proposal should not involve too much code change, or
>>> increase
>>> > code
>>> > > complexity too much.
>>> > > To prove this, maybe we need to show some small experimental code
>>> change.
>>> > >
>>> > > Best,
>>> > > Liya Fan
>>> > >
>>> > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
>>> > >
>>> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > Hi Liya Fan,
>>> > > > I'm a little concerned that this will change assumptions for at
>>> least
>>> > > some
>>> > > > of the clients using the library (some might always rely on the
>>> > validity
>>> > > > buffer being present).
>>> > > >
>>> > > > I think this is a good feature to have for the reasons you
>>> mentioned.
>>> > It
>>> > > > seems like there would need to be some sort of configuration bit
>>> to set
>>> > > for
>>> > > > this behavior. But, I'd be worried about code complexity this would
>>> > > > introduce.
>>> > > >
>>> > > > Thanks,
>>> > > > Micah
>>> > > >
>>> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com>
>>> wrote:
>>> > > >
>>> > > > > Hi Wes,
>>> > > > >
>>> > > > > Thanks a lot for your quick reply.
>>> > > > > I think what you mentioned is almost exactly what we want to do
>>> in
>>> > > > Java.The
>>> > > > > concept is not important.
>>> > > > >
>>> > > > > Maybe there are only some minor differences:
>>> > > > > 1. In C++, the null_count is mutable, while for Java, once a
>>> vector
>>> > is
>>> > > > > constructed as non-nullable, its null count can only be 0.
>>> > > > > 2. In C++, a non-nullable array's validity buffer is null, while
>>> in
>>> > > Java,
>>> > > > > the buffer is an empty buffer, and cannot be changed.
>>> > > > >
>>> > > > > Best,
>>> > > > > Liya Fan
>>> > > > >
>>> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <
>>> wesmckinn@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > > hi Liya,
>>> > > > > >
>>> > > > > > In C++ we elect certain faster code paths when the null count
>>> is 0
>>> > or
>>> > > > > > computed to be zero. When the null count is 0, we do not
>>> allocate a
>>> > > > > > validity bitmap. And there is a "nullable" metadata-only flag
>>> at
>>> > the
>>> > > > > > Field level. Could the same kinds of optimizations be
>>> implemented
>>> > in
>>> > > > > > Java without introducing a "nullable" concept?
>>> > > > > >
>>> > > > > > - Wes
>>> > > > > >
>>> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <liya.fan03@gmail.com
>>> >
>>> > > wrote:
>>> > > > > > >
>>> > > > > > > Dear all,
>>> > > > > > >
>>> > > > > > > A non-nullable vector is one that is guaranteed to contain no
>>> > > nulls.
>>> > > > We
>>> > > > > > > want to support non-nullable vectors in Java.
>>> > > > > > >
>>> > > > > > > *Motivations:*
>>> > > > > > > 1. It is widely used in practice. For example, in a database
>>> > > engine,
>>> > > > a
>>> > > > > > > column can be declared as not null, so it cannot contain null
>>> > > values.
>>> > > > > > > 2.Non-nullable vectors has significant performance advantages
>>> > > > compared
>>> > > > > > with
>>> > > > > > > their nullable conterparts, such as:
>>> > > > > > >   1) the memory space of the validity buffer can be saved.
>>> > > > > > >   2) manipulation of the validity buffer can be bypassed
>>> > > > > > >   3) some if-else branches can be replaced by sequential
>>> > > instructions
>>> > > > > (by
>>> > > > > > > the JIT compiler), leading to high throughput for the CPU
>>> > pipeline.
>>> > > > > > >
>>> > > > > > > *Potential Cost:*
>>> > > > > > > For nullable vectors, there can be extra checks against the
>>> > > > > nullablility
>>> > > > > > > flag. So we must change the code in a way that minimizes the
>>> > cost.
>>> > > > > > >
>>> > > > > > > *Proposed Changes:*
>>> > > > > > > 1. There is no need to create new vector classes. We add a
>>> final
>>> > > > > boolean
>>> > > > > > to
>>> > > > > > > the vector base classes as the nullability flag. The value
>>> of the
>>> > > > flag
>>> > > > > > can
>>> > > > > > > be obtained from the field when creating the vector.
>>> > > > > > > 2. Add a method "boolean isNullable()" to the root interface
>>> > > > > ValueVector.
>>> > > > > > > 3. If a vector is non-nullable, its validity buffer should
>>> be an
>>> > > > empty
>>> > > > > > > buffer (not null, so much of the existing logic can be left
>>> > > > unchanged).
>>> > > > > > > 4. For operations involving validity buffers (e.g. isNull,
>>> get,
>>> > > set),
>>> > > > > we
>>> > > > > > > use the nullability flag to bypass manipulations to the
>>> validity
>>> > > > > buffer.
>>> > > > > > >
>>> > > > > > > Therefore, it should be possible to support the feature with
>>> > small
>>> > > > code
>>> > > > > > > changes.
>>> > > > > > >
>>> > > > > > > BTW, please note that similar behaviors have already been
>>> > supported
>>> > > > in
>>> > > > > > C++.
>>> > > > > > >
>>> > > > > > > Would you please give your valueable feedback?
>>> > > > > > >
>>> > > > > > > Best,
>>> > > > > > > Liya Fan
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Fan Liya <li...@gmail.com>.
Hi all,

Thanks a lot for your previous feedback.

Now we have made some investigation and prepared an initial PR supporting
the non-nullable IntVector [1], as this represents a common scenario.
Some initial observations and conclusions can be made.

The basic idea of the PR is to provide a global static final flag
(NON_NULLABLE_VECTORS_ENABLED) to enable/disable the new feature.
1. When the flag is enabled, manipulations to the validity buffer would be
by-passed if the vector is non-nullable.
2. When the flag is disabled, the behavior is identical to the original
code.

*First*, we want to show that the change is small. To support the
non-nullable IntVector, we need to change classes IntVector,
BaseFixedWidthVector (the superclass of IntVector), and BaseValueVector
(the superclass of BaseFixedWidthVector). The amount of changes (lines of
code) to each class is given below

Class

Additions (#lines)

Deletions (#lines)

Total # lines

IntVector

12

5

370

BaseFixedWidthVector

30

11

925

BaseValueVector

7

0

240

 It can be seen that the change is small, relative to the class size. In
addition, to support additional vector types, we only need to change the
sub-classes, and no more need to change the super classes.

*Second*, we want to show that the performance improvement is notable. To
see this, we give the performance data of the IntBenchmark (with some
benchmarks added in the PR). To make the performance as good as possible,
we enable the ARROW_ENABLE_UNSAFE_MEMORY_ACCESS flag and disable
the ARROW_ENABLE_NULL_CHECK_FOR_GET flag. Below, we give the data with the
non-nullable vector flag turned off and on, respectively.








*(Vector non-nullable flag off)Benchmark                         Mode  Cnt
     Score     Error  UnitsIntBenchmarks.getInt              avgt    5
 384.948 ±   3.336  ns/opIntBenchmarks.isNull              avgt    5
1301.005 ±  54.239  ns/opIntBenchmarks.setIntDirectly      avgt    5
 15387.486 ± 555.749  ns/opIntBenchmarks.setWithValueHolder  avgt    5
 15251.351 ± 134.286  ns/opIntBenchmarks.setWithWriter       avgt    5
 28595.586 ± 932.528  ns/op*

*(Vector non-nullable flag on)*




*IntBenchmarks.getInt              avgt    5    384.630 ±    1.547
 ns/opIntBenchmarks.isNull              avgt    5      3.004 ±    0.110
 ns/opIntBenchmarks.setIntDirectly      avgt    5  13511.605 ±  135.455
 ns/opIntBenchmarks.setWithValueHolder  avgt    5  13035.883 ±  196.081
 ns/opIntBenchmarks.setWithWriter       avgt    5  24734.825 ± 1603.708
 ns/op*

For the *getInt *operation, there is little performance difference. This is
because we disable the  ARROW_ENABLE_NULL_CHECK_FOR_GET flag, so
manipulations to the validity buffer are by-passed, even if the
non-nullable vector flag is off. When the  ARROW_ENABLE_NULL_CHECK_FOR_GET
is enabled, there is a 72.4% performance improvement gained by turning on
the non-nullable vector flag.

For the isNull operation, we see 3 orders of magnitude performance
improvements by enabling the non-nullable vector flag.

For other operations, we see 12.2%, 13.5% and 14.5% performance
improvements by turning on the non-nullable vector flag.

So it can be seen that notable performance improvements can be gained for
non-nullable vectors.

*Third*, we want to claim that for nullable vectors and scenarios when we
turn off the non-nullable vector flag, the new changes do not introduce
performance regression. Such concern is plausible, as our changes add some
if-else branches to the code, which may degrade performance.

We give the benchmark results of the original code, as below.

*(original ocde)*





*Benchmark                         Mode  Cnt      Score     Error
 UnitsIntBenchmarks.getInt              avgt    5    383.511 ±   0.156
 ns/opIntBenchmarks.isNull              avgt    5   1274.271 ±  19.092
 ns/opIntBenchmarks.setIntDirectly      avgt    5  15162.219 ± 194.956
 ns/opIntBenchmarks.setWithValueHolder  avgt    5  15247.640 ± 153.103
 ns/opIntBenchmarks.setWithWriter       avgt    5  28587.780 ± 160.458
 ns/op*

By comparing this set of results with the above results with non-nullable
vector flag disabled, little performance difference can be observed,
indicating no performance regression.

By examining the generated assembly, it can be seen that JIT is smart
enough to remove the if-else branch completely (the below screen-shot gives
an example where the if branch is optimized away in the assembly)

So if some users do not like this feature, because their vectors are always
nullable, they can simply disable the flag, and no performance difference
can be observed.

Would you please give some feedback?
If it looks good to you, maybe we can go ahead to support other types of
vectors.
Thank you in advance.

Liya Fan

[image: image.png]

[1] https://github.com/apache/arrow/pull/8147

On Fri, Mar 13, 2020 at 9:47 PM Fan Liya <li...@gmail.com> wrote:

> Hi Jacques,
>
> Thanks a lot for your valuable comments.
>
> I agree with you that collapsing nullable and non-nullable implementations
> is a good idea, and it does not contradict with the idea of introducing a
> fast code path, if it does not introduce much cost or code complexity.
>
> The idea of word level checking is interesting.
>
> As you suggested, we will do more investigations and reconsider the
> problem from a broader perspective.
>
> Best,
> Liya Fan
>
>
>
> On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <ja...@apache.org> wrote:
>
>> Generally Ive found that this isnt an important optimization in the use
>> cases we see. Memory overhead, especially with our Java shared allocation
>> scheme is nominal. Optimizing null checks at the word level usually is
>> much
>> more impactful since non null and null runs are much more common on a
>> shorter window common than they seeing those declared.
>>
>> In other words, I'd suggest you look at your problem with a broader
>> perspective and see whether you're actually focused on optimizing the most
>> important dimension.
>>
>> As an aside, the original Arrow Java code actually treated these concepts
>> more distinctly and we consciously made a decision to collapse them to
>> simplify real world use.
>>
>> I do think it makes to add a dirty read interface if you want. This would
>> allow consumers of the interface to behave efficiently if they wanted to.
>>
>> One last note, efficient evaluation and processing should generally always
>> work at the validity word level. Adding an extra if check at the word
>> versus extra complexity of having an early out per batch seems like a
>> pretty small life in the grand scheme of processing.
>>
>> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com> wrote:
>>
>> > > And there is a "nullable" metadata-only flag at the
>> > > Field level. Could the same kinds of optimizations be implemented in
>> > > Java without introducing a "nullable" concept?
>> >
>> > Note Liya Fan did suggest pulling the nullable flag from the Field when
>> the
>> > vector is created in item (1) of the proposed changes.
>> >
>> > Brian
>> >
>> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:
>> >
>> > > Hi Micah,
>> > >
>> > > Thanks a lot for your valuable comments. Please see my comments
>> inline.
>> > >
>> > > > I'm a little concerned that this will change assumptions for at
>> least
>> > > some
>> > > > of the clients using the library (some might always rely on the
>> > validity
>> > > > buffer being present).
>> > >
>> > > I can understand your concern and I am also concerned.
>> > > IMO, the client should not depend on this assumption, as the
>> > specification
>> > > says "Arrays having a 0 null count may choose to not allocate the
>> > validity
>> > > bitmap." [1]
>> > > That being said, I think it would be safe to provide a global flag to
>> > > switch on/off the feature (as you suggested).
>> > >
>> > > > I think this is a good feature to have for the reasons you
>> mentioned.
>> > It
>> > > > seems like there would need to be some sort of configuration bit to
>> set
>> > > for
>> > > > this behavior.
>> > >
>> > > Good suggestion. We should be able to switch on and off the feature
>> with
>> > a
>> > > single global flag.
>> > >
>> > > > But, I'd be worried about code complexity this would
>> > > > introduce.
>> > >
>> > > I agree with you that code complexity is an important factor to
>> consider.
>> > > IMO, our proposal should not involve too much code change, or increase
>> > code
>> > > complexity too much.
>> > > To prove this, maybe we need to show some small experimental code
>> change.
>> > >
>> > > Best,
>> > > Liya Fan
>> > >
>> > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
>> > >
>> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Liya Fan,
>> > > > I'm a little concerned that this will change assumptions for at
>> least
>> > > some
>> > > > of the clients using the library (some might always rely on the
>> > validity
>> > > > buffer being present).
>> > > >
>> > > > I think this is a good feature to have for the reasons you
>> mentioned.
>> > It
>> > > > seems like there would need to be some sort of configuration bit to
>> set
>> > > for
>> > > > this behavior. But, I'd be worried about code complexity this would
>> > > > introduce.
>> > > >
>> > > > Thanks,
>> > > > Micah
>> > > >
>> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi Wes,
>> > > > >
>> > > > > Thanks a lot for your quick reply.
>> > > > > I think what you mentioned is almost exactly what we want to do in
>> > > > Java.The
>> > > > > concept is not important.
>> > > > >
>> > > > > Maybe there are only some minor differences:
>> > > > > 1. In C++, the null_count is mutable, while for Java, once a
>> vector
>> > is
>> > > > > constructed as non-nullable, its null count can only be 0.
>> > > > > 2. In C++, a non-nullable array's validity buffer is null, while
>> in
>> > > Java,
>> > > > > the buffer is an empty buffer, and cannot be changed.
>> > > > >
>> > > > > Best,
>> > > > > Liya Fan
>> > > > >
>> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <wesmckinn@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > hi Liya,
>> > > > > >
>> > > > > > In C++ we elect certain faster code paths when the null count
>> is 0
>> > or
>> > > > > > computed to be zero. When the null count is 0, we do not
>> allocate a
>> > > > > > validity bitmap. And there is a "nullable" metadata-only flag at
>> > the
>> > > > > > Field level. Could the same kinds of optimizations be
>> implemented
>> > in
>> > > > > > Java without introducing a "nullable" concept?
>> > > > > >
>> > > > > > - Wes
>> > > > > >
>> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com>
>> > > wrote:
>> > > > > > >
>> > > > > > > Dear all,
>> > > > > > >
>> > > > > > > A non-nullable vector is one that is guaranteed to contain no
>> > > nulls.
>> > > > We
>> > > > > > > want to support non-nullable vectors in Java.
>> > > > > > >
>> > > > > > > *Motivations:*
>> > > > > > > 1. It is widely used in practice. For example, in a database
>> > > engine,
>> > > > a
>> > > > > > > column can be declared as not null, so it cannot contain null
>> > > values.
>> > > > > > > 2.Non-nullable vectors has significant performance advantages
>> > > > compared
>> > > > > > with
>> > > > > > > their nullable conterparts, such as:
>> > > > > > >   1) the memory space of the validity buffer can be saved.
>> > > > > > >   2) manipulation of the validity buffer can be bypassed
>> > > > > > >   3) some if-else branches can be replaced by sequential
>> > > instructions
>> > > > > (by
>> > > > > > > the JIT compiler), leading to high throughput for the CPU
>> > pipeline.
>> > > > > > >
>> > > > > > > *Potential Cost:*
>> > > > > > > For nullable vectors, there can be extra checks against the
>> > > > > nullablility
>> > > > > > > flag. So we must change the code in a way that minimizes the
>> > cost.
>> > > > > > >
>> > > > > > > *Proposed Changes:*
>> > > > > > > 1. There is no need to create new vector classes. We add a
>> final
>> > > > > boolean
>> > > > > > to
>> > > > > > > the vector base classes as the nullability flag. The value of
>> the
>> > > > flag
>> > > > > > can
>> > > > > > > be obtained from the field when creating the vector.
>> > > > > > > 2. Add a method "boolean isNullable()" to the root interface
>> > > > > ValueVector.
>> > > > > > > 3. If a vector is non-nullable, its validity buffer should be
>> an
>> > > > empty
>> > > > > > > buffer (not null, so much of the existing logic can be left
>> > > > unchanged).
>> > > > > > > 4. For operations involving validity buffers (e.g. isNull,
>> get,
>> > > set),
>> > > > > we
>> > > > > > > use the nullability flag to bypass manipulations to the
>> validity
>> > > > > buffer.
>> > > > > > >
>> > > > > > > Therefore, it should be possible to support the feature with
>> > small
>> > > > code
>> > > > > > > changes.
>> > > > > > >
>> > > > > > > BTW, please note that similar behaviors have already been
>> > supported
>> > > > in
>> > > > > > C++.
>> > > > > > >
>> > > > > > > Would you please give your valueable feedback?
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Liya Fan
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Fan Liya <li...@gmail.com>.
Hi Jacques,

Thanks a lot for your valuable comments.

I agree with you that collapsing nullable and non-nullable implementations
is a good idea, and it does not contradict with the idea of introducing a
fast code path, if it does not introduce much cost or code complexity.

The idea of word level checking is interesting.

As you suggested, we will do more investigations and reconsider the problem
from a broader perspective.

Best,
Liya Fan



On Thu, Mar 12, 2020 at 9:27 AM Jacques Nadeau <ja...@apache.org> wrote:

> Generally Ive found that this isnt an important optimization in the use
> cases we see. Memory overhead, especially with our Java shared allocation
> scheme is nominal. Optimizing null checks at the word level usually is much
> more impactful since non null and null runs are much more common on a
> shorter window common than they seeing those declared.
>
> In other words, I'd suggest you look at your problem with a broader
> perspective and see whether you're actually focused on optimizing the most
> important dimension.
>
> As an aside, the original Arrow Java code actually treated these concepts
> more distinctly and we consciously made a decision to collapse them to
> simplify real world use.
>
> I do think it makes to add a dirty read interface if you want. This would
> allow consumers of the interface to behave efficiently if they wanted to.
>
> One last note, efficient evaluation and processing should generally always
> work at the validity word level. Adding an extra if check at the word
> versus extra complexity of having an early out per batch seems like a
> pretty small life in the grand scheme of processing.
>
> On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com> wrote:
>
> > > And there is a "nullable" metadata-only flag at the
> > > Field level. Could the same kinds of optimizations be implemented in
> > > Java without introducing a "nullable" concept?
> >
> > Note Liya Fan did suggest pulling the nullable flag from the Field when
> the
> > vector is created in item (1) of the proposed changes.
> >
> > Brian
> >
> > On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Micah,
> > >
> > > Thanks a lot for your valuable comments. Please see my comments inline.
> > >
> > > > I'm a little concerned that this will change assumptions for at least
> > > some
> > > > of the clients using the library (some might always rely on the
> > validity
> > > > buffer being present).
> > >
> > > I can understand your concern and I am also concerned.
> > > IMO, the client should not depend on this assumption, as the
> > specification
> > > says "Arrays having a 0 null count may choose to not allocate the
> > validity
> > > bitmap." [1]
> > > That being said, I think it would be safe to provide a global flag to
> > > switch on/off the feature (as you suggested).
> > >
> > > > I think this is a good feature to have for the reasons you mentioned.
> > It
> > > > seems like there would need to be some sort of configuration bit to
> set
> > > for
> > > > this behavior.
> > >
> > > Good suggestion. We should be able to switch on and off the feature
> with
> > a
> > > single global flag.
> > >
> > > > But, I'd be worried about code complexity this would
> > > > introduce.
> > >
> > > I agree with you that code complexity is an important factor to
> consider.
> > > IMO, our proposal should not involve too much code change, or increase
> > code
> > > complexity too much.
> > > To prove this, maybe we need to show some small experimental code
> change.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
> > >
> > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Liya Fan,
> > > > I'm a little concerned that this will change assumptions for at least
> > > some
> > > > of the clients using the library (some might always rely on the
> > validity
> > > > buffer being present).
> > > >
> > > > I think this is a good feature to have for the reasons you mentioned.
> > It
> > > > seems like there would need to be some sort of configuration bit to
> set
> > > for
> > > > this behavior. But, I'd be worried about code complexity this would
> > > > introduce.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com>
> wrote:
> > > >
> > > > > Hi Wes,
> > > > >
> > > > > Thanks a lot for your quick reply.
> > > > > I think what you mentioned is almost exactly what we want to do in
> > > > Java.The
> > > > > concept is not important.
> > > > >
> > > > > Maybe there are only some minor differences:
> > > > > 1. In C++, the null_count is mutable, while for Java, once a vector
> > is
> > > > > constructed as non-nullable, its null count can only be 0.
> > > > > 2. In C++, a non-nullable array's validity buffer is null, while in
> > > Java,
> > > > > the buffer is an empty buffer, and cannot be changed.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > hi Liya,
> > > > > >
> > > > > > In C++ we elect certain faster code paths when the null count is
> 0
> > or
> > > > > > computed to be zero. When the null count is 0, we do not
> allocate a
> > > > > > validity bitmap. And there is a "nullable" metadata-only flag at
> > the
> > > > > > Field level. Could the same kinds of optimizations be implemented
> > in
> > > > > > Java without introducing a "nullable" concept?
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > Dear all,
> > > > > > >
> > > > > > > A non-nullable vector is one that is guaranteed to contain no
> > > nulls.
> > > > We
> > > > > > > want to support non-nullable vectors in Java.
> > > > > > >
> > > > > > > *Motivations:*
> > > > > > > 1. It is widely used in practice. For example, in a database
> > > engine,
> > > > a
> > > > > > > column can be declared as not null, so it cannot contain null
> > > values.
> > > > > > > 2.Non-nullable vectors has significant performance advantages
> > > > compared
> > > > > > with
> > > > > > > their nullable conterparts, such as:
> > > > > > >   1) the memory space of the validity buffer can be saved.
> > > > > > >   2) manipulation of the validity buffer can be bypassed
> > > > > > >   3) some if-else branches can be replaced by sequential
> > > instructions
> > > > > (by
> > > > > > > the JIT compiler), leading to high throughput for the CPU
> > pipeline.
> > > > > > >
> > > > > > > *Potential Cost:*
> > > > > > > For nullable vectors, there can be extra checks against the
> > > > > nullablility
> > > > > > > flag. So we must change the code in a way that minimizes the
> > cost.
> > > > > > >
> > > > > > > *Proposed Changes:*
> > > > > > > 1. There is no need to create new vector classes. We add a
> final
> > > > > boolean
> > > > > > to
> > > > > > > the vector base classes as the nullability flag. The value of
> the
> > > > flag
> > > > > > can
> > > > > > > be obtained from the field when creating the vector.
> > > > > > > 2. Add a method "boolean isNullable()" to the root interface
> > > > > ValueVector.
> > > > > > > 3. If a vector is non-nullable, its validity buffer should be
> an
> > > > empty
> > > > > > > buffer (not null, so much of the existing logic can be left
> > > > unchanged).
> > > > > > > 4. For operations involving validity buffers (e.g. isNull, get,
> > > set),
> > > > > we
> > > > > > > use the nullability flag to bypass manipulations to the
> validity
> > > > > buffer.
> > > > > > >
> > > > > > > Therefore, it should be possible to support the feature with
> > small
> > > > code
> > > > > > > changes.
> > > > > > >
> > > > > > > BTW, please note that similar behaviors have already been
> > supported
> > > > in
> > > > > > C++.
> > > > > > >
> > > > > > > Would you please give your valueable feedback?
> > > > > > >
> > > > > > > Best,
> > > > > > > Liya Fan
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Jacques Nadeau <ja...@apache.org>.
Generally Ive found that this isnt an important optimization in the use
cases we see. Memory overhead, especially with our Java shared allocation
scheme is nominal. Optimizing null checks at the word level usually is much
more impactful since non null and null runs are much more common on a
shorter window common than they seeing those declared.

In other words, I'd suggest you look at your problem with a broader
perspective and see whether you're actually focused on optimizing the most
important dimension.

As an aside, the original Arrow Java code actually treated these concepts
more distinctly and we consciously made a decision to collapse them to
simplify real world use.

I do think it makes to add a dirty read interface if you want. This would
allow consumers of the interface to behave efficiently if they wanted to.

One last note, efficient evaluation and processing should generally always
work at the validity word level. Adding an extra if check at the word
versus extra complexity of having an early out per batch seems like a
pretty small life in the grand scheme of processing.

On Wed, Mar 11, 2020, 9:15 AM Brian Hulette <hu...@gmail.com> wrote:

> > And there is a "nullable" metadata-only flag at the
> > Field level. Could the same kinds of optimizations be implemented in
> > Java without introducing a "nullable" concept?
>
> Note Liya Fan did suggest pulling the nullable flag from the Field when the
> vector is created in item (1) of the proposed changes.
>
> Brian
>
> On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Micah,
> >
> > Thanks a lot for your valuable comments. Please see my comments inline.
> >
> > > I'm a little concerned that this will change assumptions for at least
> > some
> > > of the clients using the library (some might always rely on the
> validity
> > > buffer being present).
> >
> > I can understand your concern and I am also concerned.
> > IMO, the client should not depend on this assumption, as the
> specification
> > says "Arrays having a 0 null count may choose to not allocate the
> validity
> > bitmap." [1]
> > That being said, I think it would be safe to provide a global flag to
> > switch on/off the feature (as you suggested).
> >
> > > I think this is a good feature to have for the reasons you mentioned.
> It
> > > seems like there would need to be some sort of configuration bit to set
> > for
> > > this behavior.
> >
> > Good suggestion. We should be able to switch on and off the feature with
> a
> > single global flag.
> >
> > > But, I'd be worried about code complexity this would
> > > introduce.
> >
> > I agree with you that code complexity is an important factor to consider.
> > IMO, our proposal should not involve too much code change, or increase
> code
> > complexity too much.
> > To prove this, maybe we need to show some small experimental code change.
> >
> > Best,
> > Liya Fan
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
> >
> > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > Hi Liya Fan,
> > > I'm a little concerned that this will change assumptions for at least
> > some
> > > of the clients using the library (some might always rely on the
> validity
> > > buffer being present).
> > >
> > > I think this is a good feature to have for the reasons you mentioned.
> It
> > > seems like there would need to be some sort of configuration bit to set
> > for
> > > this behavior. But, I'd be worried about code complexity this would
> > > introduce.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > Thanks a lot for your quick reply.
> > > > I think what you mentioned is almost exactly what we want to do in
> > > Java.The
> > > > concept is not important.
> > > >
> > > > Maybe there are only some minor differences:
> > > > 1. In C++, the null_count is mutable, while for Java, once a vector
> is
> > > > constructed as non-nullable, its null count can only be 0.
> > > > 2. In C++, a non-nullable array's validity buffer is null, while in
> > Java,
> > > > the buffer is an empty buffer, and cannot be changed.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > hi Liya,
> > > > >
> > > > > In C++ we elect certain faster code paths when the null count is 0
> or
> > > > > computed to be zero. When the null count is 0, we do not allocate a
> > > > > validity bitmap. And there is a "nullable" metadata-only flag at
> the
> > > > > Field level. Could the same kinds of optimizations be implemented
> in
> > > > > Java without introducing a "nullable" concept?
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > > A non-nullable vector is one that is guaranteed to contain no
> > nulls.
> > > We
> > > > > > want to support non-nullable vectors in Java.
> > > > > >
> > > > > > *Motivations:*
> > > > > > 1. It is widely used in practice. For example, in a database
> > engine,
> > > a
> > > > > > column can be declared as not null, so it cannot contain null
> > values.
> > > > > > 2.Non-nullable vectors has significant performance advantages
> > > compared
> > > > > with
> > > > > > their nullable conterparts, such as:
> > > > > >   1) the memory space of the validity buffer can be saved.
> > > > > >   2) manipulation of the validity buffer can be bypassed
> > > > > >   3) some if-else branches can be replaced by sequential
> > instructions
> > > > (by
> > > > > > the JIT compiler), leading to high throughput for the CPU
> pipeline.
> > > > > >
> > > > > > *Potential Cost:*
> > > > > > For nullable vectors, there can be extra checks against the
> > > > nullablility
> > > > > > flag. So we must change the code in a way that minimizes the
> cost.
> > > > > >
> > > > > > *Proposed Changes:*
> > > > > > 1. There is no need to create new vector classes. We add a final
> > > > boolean
> > > > > to
> > > > > > the vector base classes as the nullability flag. The value of the
> > > flag
> > > > > can
> > > > > > be obtained from the field when creating the vector.
> > > > > > 2. Add a method "boolean isNullable()" to the root interface
> > > > ValueVector.
> > > > > > 3. If a vector is non-nullable, its validity buffer should be an
> > > empty
> > > > > > buffer (not null, so much of the existing logic can be left
> > > unchanged).
> > > > > > 4. For operations involving validity buffers (e.g. isNull, get,
> > set),
> > > > we
> > > > > > use the nullability flag to bypass manipulations to the validity
> > > > buffer.
> > > > > >
> > > > > > Therefore, it should be possible to support the feature with
> small
> > > code
> > > > > > changes.
> > > > > >
> > > > > > BTW, please note that similar behaviors have already been
> supported
> > > in
> > > > > C++.
> > > > > >
> > > > > > Would you please give your valueable feedback?
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Brian Hulette <hu...@gmail.com>.
> And there is a "nullable" metadata-only flag at the
> Field level. Could the same kinds of optimizations be implemented in
> Java without introducing a "nullable" concept?

Note Liya Fan did suggest pulling the nullable flag from the Field when the
vector is created in item (1) of the proposed changes.

Brian

On Wed, Mar 11, 2020 at 5:54 AM Fan Liya <li...@gmail.com> wrote:

> Hi Micah,
>
> Thanks a lot for your valuable comments. Please see my comments inline.
>
> > I'm a little concerned that this will change assumptions for at least
> some
> > of the clients using the library (some might always rely on the validity
> > buffer being present).
>
> I can understand your concern and I am also concerned.
> IMO, the client should not depend on this assumption, as the specification
> says "Arrays having a 0 null count may choose to not allocate the validity
> bitmap." [1]
> That being said, I think it would be safe to provide a global flag to
> switch on/off the feature (as you suggested).
>
> > I think this is a good feature to have for the reasons you mentioned. It
> > seems like there would need to be some sort of configuration bit to set
> for
> > this behavior.
>
> Good suggestion. We should be able to switch on and off the feature with a
> single global flag.
>
> > But, I'd be worried about code complexity this would
> > introduce.
>
> I agree with you that code complexity is an important factor to consider.
> IMO, our proposal should not involve too much code change, or increase code
> complexity too much.
> To prove this, maybe we need to show some small experimental code change.
>
> Best,
> Liya Fan
>
> [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
>
> On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Liya Fan,
> > I'm a little concerned that this will change assumptions for at least
> some
> > of the clients using the library (some might always rely on the validity
> > buffer being present).
> >
> > I think this is a good feature to have for the reasons you mentioned. It
> > seems like there would need to be some sort of configuration bit to set
> for
> > this behavior. But, I'd be worried about code complexity this would
> > introduce.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Wes,
> > >
> > > Thanks a lot for your quick reply.
> > > I think what you mentioned is almost exactly what we want to do in
> > Java.The
> > > concept is not important.
> > >
> > > Maybe there are only some minor differences:
> > > 1. In C++, the null_count is mutable, while for Java, once a vector is
> > > constructed as non-nullable, its null count can only be 0.
> > > 2. In C++, a non-nullable array's validity buffer is null, while in
> Java,
> > > the buffer is an empty buffer, and cannot be changed.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi Liya,
> > > >
> > > > In C++ we elect certain faster code paths when the null count is 0 or
> > > > computed to be zero. When the null count is 0, we do not allocate a
> > > > validity bitmap. And there is a "nullable" metadata-only flag at the
> > > > Field level. Could the same kinds of optimizations be implemented in
> > > > Java without introducing a "nullable" concept?
> > > >
> > > > - Wes
> > > >
> > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com>
> wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > > A non-nullable vector is one that is guaranteed to contain no
> nulls.
> > We
> > > > > want to support non-nullable vectors in Java.
> > > > >
> > > > > *Motivations:*
> > > > > 1. It is widely used in practice. For example, in a database
> engine,
> > a
> > > > > column can be declared as not null, so it cannot contain null
> values.
> > > > > 2.Non-nullable vectors has significant performance advantages
> > compared
> > > > with
> > > > > their nullable conterparts, such as:
> > > > >   1) the memory space of the validity buffer can be saved.
> > > > >   2) manipulation of the validity buffer can be bypassed
> > > > >   3) some if-else branches can be replaced by sequential
> instructions
> > > (by
> > > > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > > > >
> > > > > *Potential Cost:*
> > > > > For nullable vectors, there can be extra checks against the
> > > nullablility
> > > > > flag. So we must change the code in a way that minimizes the cost.
> > > > >
> > > > > *Proposed Changes:*
> > > > > 1. There is no need to create new vector classes. We add a final
> > > boolean
> > > > to
> > > > > the vector base classes as the nullability flag. The value of the
> > flag
> > > > can
> > > > > be obtained from the field when creating the vector.
> > > > > 2. Add a method "boolean isNullable()" to the root interface
> > > ValueVector.
> > > > > 3. If a vector is non-nullable, its validity buffer should be an
> > empty
> > > > > buffer (not null, so much of the existing logic can be left
> > unchanged).
> > > > > 4. For operations involving validity buffers (e.g. isNull, get,
> set),
> > > we
> > > > > use the nullability flag to bypass manipulations to the validity
> > > buffer.
> > > > >
> > > > > Therefore, it should be possible to support the feature with small
> > code
> > > > > changes.
> > > > >
> > > > > BTW, please note that similar behaviors have already been supported
> > in
> > > > C++.
> > > > >
> > > > > Would you please give your valueable feedback?
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > >
> > >
> >
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Fan Liya <li...@gmail.com>.
Hi Micah,

Thanks a lot for your valuable comments. Please see my comments inline.

> I'm a little concerned that this will change assumptions for at least some
> of the clients using the library (some might always rely on the validity
> buffer being present).

I can understand your concern and I am also concerned.
IMO, the client should not depend on this assumption, as the specification
says "Arrays having a 0 null count may choose to not allocate the validity
bitmap." [1]
That being said, I think it would be safe to provide a global flag to
switch on/off the feature (as you suggested).

> I think this is a good feature to have for the reasons you mentioned. It
> seems like there would need to be some sort of configuration bit to set
for
> this behavior.

Good suggestion. We should be able to switch on and off the feature with a
single global flag.

> But, I'd be worried about code complexity this would
> introduce.

I agree with you that code complexity is an important factor to consider.
IMO, our proposal should not involve too much code change, or increase code
complexity too much.
To prove this, maybe we need to show some small experimental code change.

Best,
Liya Fan

[1] https://arrow.apache.org/docs/format/Columnar.html#logical-types

On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Liya Fan,
> I'm a little concerned that this will change assumptions for at least some
> of the clients using the library (some might always rely on the validity
> buffer being present).
>
> I think this is a good feature to have for the reasons you mentioned. It
> seems like there would need to be some sort of configuration bit to set for
> this behavior. But, I'd be worried about code complexity this would
> introduce.
>
> Thanks,
> Micah
>
> On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Wes,
> >
> > Thanks a lot for your quick reply.
> > I think what you mentioned is almost exactly what we want to do in
> Java.The
> > concept is not important.
> >
> > Maybe there are only some minor differences:
> > 1. In C++, the null_count is mutable, while for Java, once a vector is
> > constructed as non-nullable, its null count can only be 0.
> > 2. In C++, a non-nullable array's validity buffer is null, while in Java,
> > the buffer is an empty buffer, and cannot be changed.
> >
> > Best,
> > Liya Fan
> >
> > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Liya,
> > >
> > > In C++ we elect certain faster code paths when the null count is 0 or
> > > computed to be zero. When the null count is 0, we do not allocate a
> > > validity bitmap. And there is a "nullable" metadata-only flag at the
> > > Field level. Could the same kinds of optimizations be implemented in
> > > Java without introducing a "nullable" concept?
> > >
> > > - Wes
> > >
> > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com> wrote:
> > > >
> > > > Dear all,
> > > >
> > > > A non-nullable vector is one that is guaranteed to contain no nulls.
> We
> > > > want to support non-nullable vectors in Java.
> > > >
> > > > *Motivations:*
> > > > 1. It is widely used in practice. For example, in a database engine,
> a
> > > > column can be declared as not null, so it cannot contain null values.
> > > > 2.Non-nullable vectors has significant performance advantages
> compared
> > > with
> > > > their nullable conterparts, such as:
> > > >   1) the memory space of the validity buffer can be saved.
> > > >   2) manipulation of the validity buffer can be bypassed
> > > >   3) some if-else branches can be replaced by sequential instructions
> > (by
> > > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > > >
> > > > *Potential Cost:*
> > > > For nullable vectors, there can be extra checks against the
> > nullablility
> > > > flag. So we must change the code in a way that minimizes the cost.
> > > >
> > > > *Proposed Changes:*
> > > > 1. There is no need to create new vector classes. We add a final
> > boolean
> > > to
> > > > the vector base classes as the nullability flag. The value of the
> flag
> > > can
> > > > be obtained from the field when creating the vector.
> > > > 2. Add a method "boolean isNullable()" to the root interface
> > ValueVector.
> > > > 3. If a vector is non-nullable, its validity buffer should be an
> empty
> > > > buffer (not null, so much of the existing logic can be left
> unchanged).
> > > > 4. For operations involving validity buffers (e.g. isNull, get, set),
> > we
> > > > use the nullability flag to bypass manipulations to the validity
> > buffer.
> > > >
> > > > Therefore, it should be possible to support the feature with small
> code
> > > > changes.
> > > >
> > > > BTW, please note that similar behaviors have already been supported
> in
> > > C++.
> > > >
> > > > Would you please give your valueable feedback?
> > > >
> > > > Best,
> > > > Liya Fan
> > >
> >
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Micah Kornfield <em...@gmail.com>.
Hi Liya Fan,
I'm a little concerned that this will change assumptions for at least some
of the clients using the library (some might always rely on the validity
buffer being present).

I think this is a good feature to have for the reasons you mentioned. It
seems like there would need to be some sort of configuration bit to set for
this behavior. But, I'd be worried about code complexity this would
introduce.

Thanks,
Micah

On Tue, Mar 10, 2020 at 6:42 AM Fan Liya <li...@gmail.com> wrote:

> Hi Wes,
>
> Thanks a lot for your quick reply.
> I think what you mentioned is almost exactly what we want to do in Java.The
> concept is not important.
>
> Maybe there are only some minor differences:
> 1. In C++, the null_count is mutable, while for Java, once a vector is
> constructed as non-nullable, its null count can only be 0.
> 2. In C++, a non-nullable array's validity buffer is null, while in Java,
> the buffer is an empty buffer, and cannot be changed.
>
> Best,
> Liya Fan
>
> On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Liya,
> >
> > In C++ we elect certain faster code paths when the null count is 0 or
> > computed to be zero. When the null count is 0, we do not allocate a
> > validity bitmap. And there is a "nullable" metadata-only flag at the
> > Field level. Could the same kinds of optimizations be implemented in
> > Java without introducing a "nullable" concept?
> >
> > - Wes
> >
> > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > Dear all,
> > >
> > > A non-nullable vector is one that is guaranteed to contain no nulls. We
> > > want to support non-nullable vectors in Java.
> > >
> > > *Motivations:*
> > > 1. It is widely used in practice. For example, in a database engine, a
> > > column can be declared as not null, so it cannot contain null values.
> > > 2.Non-nullable vectors has significant performance advantages compared
> > with
> > > their nullable conterparts, such as:
> > >   1) the memory space of the validity buffer can be saved.
> > >   2) manipulation of the validity buffer can be bypassed
> > >   3) some if-else branches can be replaced by sequential instructions
> (by
> > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > >
> > > *Potential Cost:*
> > > For nullable vectors, there can be extra checks against the
> nullablility
> > > flag. So we must change the code in a way that minimizes the cost.
> > >
> > > *Proposed Changes:*
> > > 1. There is no need to create new vector classes. We add a final
> boolean
> > to
> > > the vector base classes as the nullability flag. The value of the flag
> > can
> > > be obtained from the field when creating the vector.
> > > 2. Add a method "boolean isNullable()" to the root interface
> ValueVector.
> > > 3. If a vector is non-nullable, its validity buffer should be an empty
> > > buffer (not null, so much of the existing logic can be left unchanged).
> > > 4. For operations involving validity buffers (e.g. isNull, get, set),
> we
> > > use the nullability flag to bypass manipulations to the validity
> buffer.
> > >
> > > Therefore, it should be possible to support the feature with small code
> > > changes.
> > >
> > > BTW, please note that similar behaviors have already been supported in
> > C++.
> > >
> > > Would you please give your valueable feedback?
> > >
> > > Best,
> > > Liya Fan
> >
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Fan Liya <li...@gmail.com>.
Hi Wes,

Thanks a lot for your quick reply.
I think what you mentioned is almost exactly what we want to do in Java.The
concept is not important.

Maybe there are only some minor differences:
1. In C++, the null_count is mutable, while for Java, once a vector is
constructed as non-nullable, its null count can only be 0.
2. In C++, a non-nullable array's validity buffer is null, while in Java,
the buffer is an empty buffer, and cannot be changed.

Best,
Liya Fan

On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney <we...@gmail.com> wrote:

> hi Liya,
>
> In C++ we elect certain faster code paths when the null count is 0 or
> computed to be zero. When the null count is 0, we do not allocate a
> validity bitmap. And there is a "nullable" metadata-only flag at the
> Field level. Could the same kinds of optimizations be implemented in
> Java without introducing a "nullable" concept?
>
> - Wes
>
> On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com> wrote:
> >
> > Dear all,
> >
> > A non-nullable vector is one that is guaranteed to contain no nulls. We
> > want to support non-nullable vectors in Java.
> >
> > *Motivations:*
> > 1. It is widely used in practice. For example, in a database engine, a
> > column can be declared as not null, so it cannot contain null values.
> > 2.Non-nullable vectors has significant performance advantages compared
> with
> > their nullable conterparts, such as:
> >   1) the memory space of the validity buffer can be saved.
> >   2) manipulation of the validity buffer can be bypassed
> >   3) some if-else branches can be replaced by sequential instructions (by
> > the JIT compiler), leading to high throughput for the CPU pipeline.
> >
> > *Potential Cost:*
> > For nullable vectors, there can be extra checks against the nullablility
> > flag. So we must change the code in a way that minimizes the cost.
> >
> > *Proposed Changes:*
> > 1. There is no need to create new vector classes. We add a final boolean
> to
> > the vector base classes as the nullability flag. The value of the flag
> can
> > be obtained from the field when creating the vector.
> > 2. Add a method "boolean isNullable()" to the root interface ValueVector.
> > 3. If a vector is non-nullable, its validity buffer should be an empty
> > buffer (not null, so much of the existing logic can be left unchanged).
> > 4. For operations involving validity buffers (e.g. isNull, get, set), we
> > use the nullability flag to bypass manipulations to the validity buffer.
> >
> > Therefore, it should be possible to support the feature with small code
> > changes.
> >
> > BTW, please note that similar behaviors have already been supported in
> C++.
> >
> > Would you please give your valueable feedback?
> >
> > Best,
> > Liya Fan
>

Re: [DISCUSS][Java] Support non-nullable vectors

Posted by Wes McKinney <we...@gmail.com>.
hi Liya,

In C++ we elect certain faster code paths when the null count is 0 or
computed to be zero. When the null count is 0, we do not allocate a
validity bitmap. And there is a "nullable" metadata-only flag at the
Field level. Could the same kinds of optimizations be implemented in
Java without introducing a "nullable" concept?

- Wes

On Tue, Mar 10, 2020 at 8:13 AM Fan Liya <li...@gmail.com> wrote:
>
> Dear all,
>
> A non-nullable vector is one that is guaranteed to contain no nulls. We
> want to support non-nullable vectors in Java.
>
> *Motivations:*
> 1. It is widely used in practice. For example, in a database engine, a
> column can be declared as not null, so it cannot contain null values.
> 2.Non-nullable vectors has significant performance advantages compared with
> their nullable conterparts, such as:
>   1) the memory space of the validity buffer can be saved.
>   2) manipulation of the validity buffer can be bypassed
>   3) some if-else branches can be replaced by sequential instructions (by
> the JIT compiler), leading to high throughput for the CPU pipeline.
>
> *Potential Cost:*
> For nullable vectors, there can be extra checks against the nullablility
> flag. So we must change the code in a way that minimizes the cost.
>
> *Proposed Changes:*
> 1. There is no need to create new vector classes. We add a final boolean to
> the vector base classes as the nullability flag. The value of the flag can
> be obtained from the field when creating the vector.
> 2. Add a method "boolean isNullable()" to the root interface ValueVector.
> 3. If a vector is non-nullable, its validity buffer should be an empty
> buffer (not null, so much of the existing logic can be left unchanged).
> 4. For operations involving validity buffers (e.g. isNull, get, set), we
> use the nullability flag to bypass manipulations to the validity buffer.
>
> Therefore, it should be possible to support the feature with small code
> changes.
>
> BTW, please note that similar behaviors have already been supported in C++.
>
> Would you please give your valueable feedback?
>
> Best,
> Liya Fan