You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by azim afroozeh <af...@gmail.com> on 2019/11/08 09:36:40 UTC

[Java] Question About Vector Allocation

Hi everyone,

I have a question about the Java implementation of Apache Arrow. Should we
always call setValueCount after creating a vector with allocateNew()?

I can see that in some tests where setValueCount is called immediately
after allocateNew. For example here:
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
,
but not in other tests:
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
.

To illustrate the problem more, if I change the isSet(int index)function as
follows:

public int isSet(int index) {
 if (valueCount == 0) {
 return 0;
 }
 final int byteIndex = index >> 3;
 final byte b = validityBuffer.getByte(byteIndex);
 final int bitIndex = index & 7;
 return (b >> bitIndex) & 0x01;
}

Many tests will fail, while logically they should not because if the
valueCount is 0 then isSet returned value for every index should be zero.
The problem comes from the allocateNew method which does not initialize the
valueCount variable.

One potential solution to this problem is to initialize the valueCount
in allocateNew function, as I did here:
https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb.
The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
allocateNew function that needs to be changed. Is this an acceptable
approach? or am I missing some semantics?

Thanks,

Azim Afroozeh

Re: [Java] Question About Vector Allocation

Posted by Micah Kornfield <em...@gmail.com>.
ValueCount include both null and not null values.  Perhaps a better name
for the method would have been setSize or setLength.

On Thursday, November 14, 2019, azim afroozeh <af...@gmail.com> wrote:

> Thanks for your answer. I have one more question. In this test function for
> example (
> https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/TestValueVector.java#L1524)
> :
>
> there is a for loop which tries to fill in some values but not all values.
> It leaves some of them as null.
>
>       for (int i = 0; i < capacity; i++) {
>         if (i % 3 == 0) {
>           continue;
>         }
>         byte[] b = Integer.toString(i).getBytes();
>         vector.setSafe(i, b, 0, b.length);
>       }
> Then there is setValueCount function which set the valueCount.
> vector.setValueCount(capacity);
>
> I think by setting the valueCount to Capacity it means that all values are
> filled in and there is not any null values in the valueVector. But Later in
> the following loop, it checks whether the unset values are null which they
> should not be null because ValueCount is equal to Capacity (All values are
> set).
>       for (int i = 0; i < capacity; i++) {
>         if (i % 3 == 0) {
>           assertNull(vector.getObject(i));
>         } else {
>           assertEquals("unexpected value at index: " + i,
> Integer.toString(i), vector.getObject(i).toString());
>         }
>       }
>
> Am I missing something here?
>
> Thanks
>
> Azim
>
> On Thu, Nov 14, 2019 at 11:56 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Azim,
> >
> > According to the current API, after filling in some values, you have to
> set
> > the value count manually (through the setValueCount method).
> > Otherwise, the value count remains 0.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Thu, Nov 14, 2019 at 6:33 PM azim afroozeh <af...@gmail.com>
> wrote:
> >
> > > Thanks for your answer. So the valueCount shows the number of data
> filled
> > > in the vector.
> > >
> > > Then I would like to ask you why the valueCount after setting some
> values
> > > is 0? for example: (
> > >
> > >
> > https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1da
> c005a505c9/java/vector/src/test/java/org/apache/arrow/
> vector/TestValueVector.java#L609
> > > )
> > >
> > >
> > > System.out.print(vector.getValueCount()); //prints 0
> > > /* populate the vector */vector.set(0, 100.5f);vector.set(2,
> > > 201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
> > > 555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
> > > 89.5f);
> > > System.out.print(vector.getValueCount()); //prints 0
> > >
> > >
> > > If I add these two print lines, they will print 0.
> > >
> > >
> > > Also If I add the following code to isSet again some tests fail.
> > >
> > >  if (valueCount == getValueCapacity()) {      return 1;    }
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > > Azim Afroozeh
> > >
> > > On Fri, Nov 8, 2019 at 10:57 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > > Hi Azim,
> > > >
> > > > I think we should be aware of two distinct concepts:
> > > >
> > > > 1. vector capacity: the max number of values that can be stored in
> the
> > > > vector, without reallocation
> > > > 2. vector length: the number of values actually filled in the vector
> > > >
> > > > For any valid vector, we always have vector length <= vector
> capacity.
> > > >
> > > > The allocateNew method expands the vector capacity, but it does not
> > fill
> > > in
> > > > any value, so it does not affect the the vector length.
> > > >
> > > > For the code above, if the vector length is 0, the value of
> > isSet(index)
> > > > (where index > 0) should be undefined. So throwing an exception is
> the
> > > > correct behavior.
> > > >
> > > > Hope this answers your question.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > >
> > > > On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh <af...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I have a question about the Java implementation of Apache Arrow.
> > Should
> > > > we
> > > > > always call setValueCount after creating a vector with
> allocateNew()?
> > > > >
> > > > > I can see that in some tests where setValueCount is called
> > immediately
> > > > > after allocateNew. For example here:
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > > > > ,
> > > > > but not in other tests:
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > > > > .
> > > > >
> > > > > To illustrate the problem more, if I change the isSet(int
> > > index)function
> > > > as
> > > > > follows:
> > > > >
> > > > > public int isSet(int index) {
> > > > >  if (valueCount == 0) {
> > > > >  return 0;
> > > > >  }
> > > > >  final int byteIndex = index >> 3;
> > > > >  final byte b = validityBuffer.getByte(byteIndex);
> > > > >  final int bitIndex = index & 7;
> > > > >  return (b >> bitIndex) & 0x01;
> > > > > }
> > > > >
> > > > > Many tests will fail, while logically they should not because if
> the
> > > > > valueCount is 0 then isSet returned value for every index should be
> > > zero.
> > > > > The problem comes from the allocateNew method which does not
> > initialize
> > > > the
> > > > > valueCount variable.
> > > > >
> > > > > One potential solution to this problem is to initialize the
> > valueCount
> > > > > in allocateNew function, as I did here:
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/azimafroozeh/arrow/commit/
> 4281613b7ed1370252a155192f12b9bca494dbeb
> > > > > .
> > > > > The classes BaseVariableWidthVector and BaseFixedWidthVector, both
> > have
> > > > > allocateNew function that needs to be changed. Is this an
> acceptable
> > > > > approach? or am I missing some semantics?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Azim Afroozeh
> > > > >
> > > >
> > >
> >
>

Re: [Java] Question About Vector Allocation

Posted by azim afroozeh <af...@gmail.com>.
Thanks for your answer. I have one more question. In this test function for
example (
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L1524)
:

there is a for loop which tries to fill in some values but not all values.
It leaves some of them as null.

      for (int i = 0; i < capacity; i++) {
        if (i % 3 == 0) {
          continue;
        }
        byte[] b = Integer.toString(i).getBytes();
        vector.setSafe(i, b, 0, b.length);
      }
Then there is setValueCount function which set the valueCount.
vector.setValueCount(capacity);

I think by setting the valueCount to Capacity it means that all values are
filled in and there is not any null values in the valueVector. But Later in
the following loop, it checks whether the unset values are null which they
should not be null because ValueCount is equal to Capacity (All values are
set).
      for (int i = 0; i < capacity; i++) {
        if (i % 3 == 0) {
          assertNull(vector.getObject(i));
        } else {
          assertEquals("unexpected value at index: " + i,
Integer.toString(i), vector.getObject(i).toString());
        }
      }

Am I missing something here?

Thanks

Azim

On Thu, Nov 14, 2019 at 11:56 AM Fan Liya <li...@gmail.com> wrote:

> Hi Azim,
>
> According to the current API, after filling in some values, you have to set
> the value count manually (through the setValueCount method).
> Otherwise, the value count remains 0.
>
> Best,
> Liya Fan
>
>
> On Thu, Nov 14, 2019 at 6:33 PM azim afroozeh <af...@gmail.com> wrote:
>
> > Thanks for your answer. So the valueCount shows the number of data filled
> > in the vector.
> >
> > Then I would like to ask you why the valueCount after setting some values
> > is 0? for example: (
> >
> >
> https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1dac005a505c9/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L609
> > )
> >
> >
> > System.out.print(vector.getValueCount()); //prints 0
> > /* populate the vector */vector.set(0, 100.5f);vector.set(2,
> > 201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
> > 555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
> > 89.5f);
> > System.out.print(vector.getValueCount()); //prints 0
> >
> >
> > If I add these two print lines, they will print 0.
> >
> >
> > Also If I add the following code to isSet again some tests fail.
> >
> >  if (valueCount == getValueCapacity()) {      return 1;    }
> >
> >
> >
> > Thanks,
> >
> >
> > Azim Afroozeh
> >
> > On Fri, Nov 8, 2019 at 10:57 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > Hi Azim,
> > >
> > > I think we should be aware of two distinct concepts:
> > >
> > > 1. vector capacity: the max number of values that can be stored in the
> > > vector, without reallocation
> > > 2. vector length: the number of values actually filled in the vector
> > >
> > > For any valid vector, we always have vector length <= vector capacity.
> > >
> > > The allocateNew method expands the vector capacity, but it does not
> fill
> > in
> > > any value, so it does not affect the the vector length.
> > >
> > > For the code above, if the vector length is 0, the value of
> isSet(index)
> > > (where index > 0) should be undefined. So throwing an exception is the
> > > correct behavior.
> > >
> > > Hope this answers your question.
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh <af...@gmail.com>
> > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I have a question about the Java implementation of Apache Arrow.
> Should
> > > we
> > > > always call setValueCount after creating a vector with allocateNew()?
> > > >
> > > > I can see that in some tests where setValueCount is called
> immediately
> > > > after allocateNew. For example here:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > > > ,
> > > > but not in other tests:
> > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > > > .
> > > >
> > > > To illustrate the problem more, if I change the isSet(int
> > index)function
> > > as
> > > > follows:
> > > >
> > > > public int isSet(int index) {
> > > >  if (valueCount == 0) {
> > > >  return 0;
> > > >  }
> > > >  final int byteIndex = index >> 3;
> > > >  final byte b = validityBuffer.getByte(byteIndex);
> > > >  final int bitIndex = index & 7;
> > > >  return (b >> bitIndex) & 0x01;
> > > > }
> > > >
> > > > Many tests will fail, while logically they should not because if the
> > > > valueCount is 0 then isSet returned value for every index should be
> > zero.
> > > > The problem comes from the allocateNew method which does not
> initialize
> > > the
> > > > valueCount variable.
> > > >
> > > > One potential solution to this problem is to initialize the
> valueCount
> > > > in allocateNew function, as I did here:
> > > >
> > > >
> > >
> >
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> > > > .
> > > > The classes BaseVariableWidthVector and BaseFixedWidthVector, both
> have
> > > > allocateNew function that needs to be changed. Is this an acceptable
> > > > approach? or am I missing some semantics?
> > > >
> > > > Thanks,
> > > >
> > > > Azim Afroozeh
> > > >
> > >
> >
>

Re: [Java] Question About Vector Allocation

Posted by Fan Liya <li...@gmail.com>.
Hi Azim,

According to the current API, after filling in some values, you have to set
the value count manually (through the setValueCount method).
Otherwise, the value count remains 0.

Best,
Liya Fan


On Thu, Nov 14, 2019 at 6:33 PM azim afroozeh <af...@gmail.com> wrote:

> Thanks for your answer. So the valueCount shows the number of data filled
> in the vector.
>
> Then I would like to ask you why the valueCount after setting some values
> is 0? for example: (
>
> https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1dac005a505c9/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L609
> )
>
>
> System.out.print(vector.getValueCount()); //prints 0
> /* populate the vector */vector.set(0, 100.5f);vector.set(2,
> 201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
> 555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
> 89.5f);
> System.out.print(vector.getValueCount()); //prints 0
>
>
> If I add these two print lines, they will print 0.
>
>
> Also If I add the following code to isSet again some tests fail.
>
>  if (valueCount == getValueCapacity()) {      return 1;    }
>
>
>
> Thanks,
>
>
> Azim Afroozeh
>
> On Fri, Nov 8, 2019 at 10:57 AM Fan Liya <li...@gmail.com> wrote:
>
> > Hi Azim,
> >
> > I think we should be aware of two distinct concepts:
> >
> > 1. vector capacity: the max number of values that can be stored in the
> > vector, without reallocation
> > 2. vector length: the number of values actually filled in the vector
> >
> > For any valid vector, we always have vector length <= vector capacity.
> >
> > The allocateNew method expands the vector capacity, but it does not fill
> in
> > any value, so it does not affect the the vector length.
> >
> > For the code above, if the vector length is 0, the value of isSet(index)
> > (where index > 0) should be undefined. So throwing an exception is the
> > correct behavior.
> >
> > Hope this answers your question.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh <af...@gmail.com>
> wrote:
> >
> > > Hi everyone,
> > >
> > > I have a question about the Java implementation of Apache Arrow. Should
> > we
> > > always call setValueCount after creating a vector with allocateNew()?
> > >
> > > I can see that in some tests where setValueCount is called immediately
> > > after allocateNew. For example here:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > > ,
> > > but not in other tests:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > > .
> > >
> > > To illustrate the problem more, if I change the isSet(int
> index)function
> > as
> > > follows:
> > >
> > > public int isSet(int index) {
> > >  if (valueCount == 0) {
> > >  return 0;
> > >  }
> > >  final int byteIndex = index >> 3;
> > >  final byte b = validityBuffer.getByte(byteIndex);
> > >  final int bitIndex = index & 7;
> > >  return (b >> bitIndex) & 0x01;
> > > }
> > >
> > > Many tests will fail, while logically they should not because if the
> > > valueCount is 0 then isSet returned value for every index should be
> zero.
> > > The problem comes from the allocateNew method which does not initialize
> > the
> > > valueCount variable.
> > >
> > > One potential solution to this problem is to initialize the valueCount
> > > in allocateNew function, as I did here:
> > >
> > >
> >
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> > > .
> > > The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
> > > allocateNew function that needs to be changed. Is this an acceptable
> > > approach? or am I missing some semantics?
> > >
> > > Thanks,
> > >
> > > Azim Afroozeh
> > >
> >
>

Re: [Java] Question About Vector Allocation

Posted by azim afroozeh <af...@gmail.com>.
Thanks for your answer. So the valueCount shows the number of data filled
in the vector.

Then I would like to ask you why the valueCount after setting some values
is 0? for example: (
https://github.com/apache/arrow/blob/3fbbcdaf77a9e354b6bd07ec1fd1dac005a505c9/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L609
)


System.out.print(vector.getValueCount()); //prints 0
/* populate the vector */vector.set(0, 100.5f);vector.set(2,
201.5f);vector.set(4, 300.3f);vector.set(6, 423.8f);vector.set(8,
555.6f);vector.set(10, 66.6f);vector.set(12, 78.8f);vector.set(14,
89.5f);
System.out.print(vector.getValueCount()); //prints 0


If I add these two print lines, they will print 0.


Also If I add the following code to isSet again some tests fail.

 if (valueCount == getValueCapacity()) {      return 1;    }



Thanks,


Azim Afroozeh

On Fri, Nov 8, 2019 at 10:57 AM Fan Liya <li...@gmail.com> wrote:

> Hi Azim,
>
> I think we should be aware of two distinct concepts:
>
> 1. vector capacity: the max number of values that can be stored in the
> vector, without reallocation
> 2. vector length: the number of values actually filled in the vector
>
> For any valid vector, we always have vector length <= vector capacity.
>
> The allocateNew method expands the vector capacity, but it does not fill in
> any value, so it does not affect the the vector length.
>
> For the code above, if the vector length is 0, the value of isSet(index)
> (where index > 0) should be undefined. So throwing an exception is the
> correct behavior.
>
> Hope this answers your question.
>
> Best,
> Liya Fan
>
>
> On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh <af...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I have a question about the Java implementation of Apache Arrow. Should
> we
> > always call setValueCount after creating a vector with allocateNew()?
> >
> > I can see that in some tests where setValueCount is called immediately
> > after allocateNew. For example here:
> >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> > ,
> > but not in other tests:
> >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> > .
> >
> > To illustrate the problem more, if I change the isSet(int index)function
> as
> > follows:
> >
> > public int isSet(int index) {
> >  if (valueCount == 0) {
> >  return 0;
> >  }
> >  final int byteIndex = index >> 3;
> >  final byte b = validityBuffer.getByte(byteIndex);
> >  final int bitIndex = index & 7;
> >  return (b >> bitIndex) & 0x01;
> > }
> >
> > Many tests will fail, while logically they should not because if the
> > valueCount is 0 then isSet returned value for every index should be zero.
> > The problem comes from the allocateNew method which does not initialize
> the
> > valueCount variable.
> >
> > One potential solution to this problem is to initialize the valueCount
> > in allocateNew function, as I did here:
> >
> >
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> > .
> > The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
> > allocateNew function that needs to be changed. Is this an acceptable
> > approach? or am I missing some semantics?
> >
> > Thanks,
> >
> > Azim Afroozeh
> >
>

Re: [Java] Question About Vector Allocation

Posted by Fan Liya <li...@gmail.com>.
Hi Azim,

I think we should be aware of two distinct concepts:

1. vector capacity: the max number of values that can be stored in the
vector, without reallocation
2. vector length: the number of values actually filled in the vector

For any valid vector, we always have vector length <= vector capacity.

The allocateNew method expands the vector capacity, but it does not fill in
any value, so it does not affect the the vector length.

For the code above, if the vector length is 0, the value of isSet(index)
(where index > 0) should be undefined. So throwing an exception is the
correct behavior.

Hope this answers your question.

Best,
Liya Fan


On Fri, Nov 8, 2019 at 5:38 PM azim afroozeh <af...@gmail.com> wrote:

> Hi everyone,
>
> I have a question about the Java implementation of Apache Arrow. Should we
> always call setValueCount after creating a vector with allocateNew()?
>
> I can see that in some tests where setValueCount is called immediately
> after allocateNew. For example here:
>
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L285
> ,
> but not in other tests:
>
> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java#L792
> .
>
> To illustrate the problem more, if I change the isSet(int index)function as
> follows:
>
> public int isSet(int index) {
>  if (valueCount == 0) {
>  return 0;
>  }
>  final int byteIndex = index >> 3;
>  final byte b = validityBuffer.getByte(byteIndex);
>  final int bitIndex = index & 7;
>  return (b >> bitIndex) & 0x01;
> }
>
> Many tests will fail, while logically they should not because if the
> valueCount is 0 then isSet returned value for every index should be zero.
> The problem comes from the allocateNew method which does not initialize the
> valueCount variable.
>
> One potential solution to this problem is to initialize the valueCount
> in allocateNew function, as I did here:
>
> https://github.com/azimafroozeh/arrow/commit/4281613b7ed1370252a155192f12b9bca494dbeb
> .
> The classes BaseVariableWidthVector and BaseFixedWidthVector, both have
> allocateNew function that needs to be changed. Is this an acceptable
> approach? or am I missing some semantics?
>
> Thanks,
>
> Azim Afroozeh
>