You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/13 09:02:45 UTC

[GitHub] [arrow] WeichenXu123 opened a new pull request #9187: [ARROW-11223] Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

WeichenXu123 opened a new pull request #9187:
URL: https://github.com/apache/arrow/pull/9187


   Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
kiszk edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762336602


   Thank you for your clarification. Looks good to me.
   cc @liyafan82 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715


   >  In a streaming scenario, in which you fill up batches up to a given size, you need to check the buffer size without actually closing the batch.
   
   Yes. and I think this is the scenario where we call `vector.getBufferSizeFor`
   
   i.e., before we call `setValueCount`, we won't use the vector, but we may call `vector.getBufferSizeFor` to monitor its buffer size. We need ensure `vector.getBufferSizeFor` works before we calling `vector.getBufferSizeFor`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-766499542


   > A better way to get an estimate of buffer size would be to include a density value for the avg number of bytes per record, similar to setInitialCapacity(int valueCount, double density). You could then get the density from a previous vector and use that to estimate the size for the next vector
   
   This looks a. valuable proposal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HyukjinKwon commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763607402


   @liyafan82, does it imply that `setValueCount` should always be called before `vector.getBufferSizeFor` to have the guaranteed result without failure? It would be great if we can explicitly document this, or remove such requirement.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bogdanghit commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
bogdanghit commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762134794


   To be specific, the issue is caused by the different allocated size of a validity buffer which is N/8 from the offset buffer (N+1)*OFFSET_WIDTH. When the ration of null versus non-null values is large and the offset buffer gets filled up before the validity buffer, the `getBufferSizeFor` call will trigger the overflow error.  It is quite easy to reproduce.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715


   >  In a streaming scenario, in which you fill up batches up to a given size, you need to check the buffer size without actually closing the batch.
   
   Yes. and I think this is the scenario where we call `vector.getBufferSizeFor()`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763600746


   @bogdanghit @WeichenXu123 Thanks for your feedback.
   
   IMO, calling `setValueCount` does not close the vector. You can still append values and call `setValueCount` multiple times.
   In the example code, if you call `setValueCount` after the for loop, the problem should disappear. 
   
   Does it make sense to you?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-764200223


   > @liyafan82 But, I don't agree. See this section comment in source code:
   > https://github.com/apache/arrow/blob/691286975f277f00586cabc6d834ff1efd8caf8c/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java#L163
   > 
   > ```
   >   /**
   >    * Returns the number of bytes that is used by this vector if it holds the given number
   >    * of values. The result will be the same as if setValueCount() were called, followed
   >    * by calling getBufferSize(), but without any of the closing side-effects that setValueCount()
   >    * implies wrt finishing off the population of a vector. Some operations might wish to use
   >    * this to determine how much memory has been used by a vector so far, even though it is
   >    * not finished being populated.
   >    *
   >    * @param valueCount the number of values to assume this vector contains
   >    * @return the buffer size if this vector is holding valueCount values
   >    */
   >   int getBufferSizeFor(int valueCount);
   > ```
   > 
   > The `getBufferSizeFor` is designed for the case: when `setValueCount` haven't been called.
   > It also definitely say `setValueCount()` will bring "closing side-effects" (implies wrt finishing off the population of a vector)
   
   @WeichenXu123 Please note that the JavaDoc is overriden when the `BaseVariableWidthVector` overrides the method in the super interface:
   
   https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L598-L603
   
   The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. (The varchars can be large or small, depending on the actual values).
   
   If the documentation is not clearly enough, can we revise it?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-778619461


   > Is it a requirement in your use case to be static?
   
   No. Just a suggestion, has pros and cons. Not a requirement.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762123598


   @kiszk Test added.
   Could this PR be backported ?
   CC @HyukjinKwon @bogdanghit 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-764197687


   > @liyafan82, does it imply that `setValueCount` should always be called before `vector.getBufferSizeFor` to have the guaranteed result without failure? It would be great if we can explicitly document this, or remove such requirement.
   
   Generally, `setValueCount` should be called before `vector.getBufferSizeFor` .  Otherwise, the vector may be in an undefined state, with unexpected behaviors. 
   It would be difficult to remove such a requirement, as it would be expensive to maintain the vector in a consisent state for all the time.
   It is a good idea to revise the documentation, as it seems to be causing the confusion.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763621581


   @liyafan82 But, I don't agree. See this section comment in source code:
   https://github.com/apache/arrow/blob/691286975f277f00586cabc6d834ff1efd8caf8c/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java#L163
   
   ```
     /**
      * Returns the number of bytes that is used by this vector if it holds the given number
      * of values. The result will be the same as if setValueCount() were called, followed
      * by calling getBufferSize(), but without any of the closing side-effects that setValueCount()
      * implies wrt finishing off the population of a vector. Some operations might wish to use
      * this to determine how much memory has been used by a vector so far, even though it is
      * not finished being populated.
      *
      * @param valueCount the number of values to assume this vector contains
      * @return the buffer size if this vector is holding valueCount values
      */
     int getBufferSizeFor(int valueCount);
   ```
   
   The `getBufferSizeFor` is designed for the case: when `setValueCount` haven't been called.
   It also definitely say `setValueCount()` will bring "closing side-effects" (implies wrt finishing off the population of a vector)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-783763424


   merged to master, thanks @WeichenXu123 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler closed pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler closed pull request #9187:
URL: https://github.com/apache/arrow/pull/9187


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer format is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values), it won't add any data into "value buffer".
   2) reset offset reader/writer index (this looks like has some side effect)
   
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763544300


   @liyafan82 And, the "by designed" code, your said, 
   
   ```
   while (index >= getValidityBufferValueCapacity()) {
     reallocValidityAndOffsetBuffers();
   }
   ```
   It doesn't looks like a "by design" code but looks like a "negligence". It doesn't save memory. Because you still need to scale up both of validity and offset buffer.
   
   If the original code is "by design" and want to save memory in this case, it should be written like:
   ```
   while (index >= getValidityBufferValueCapacity()) {
     reallocValidityBuffer();
   }
   ```
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715


   >  In a streaming scenario, in which you fill up batches up to a given size, you need to check the buffer size without actually closing the batch.
   
   Yes. and I think this is the scenario where we call `vector.getBufferSizeFor`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763362710


   Thanks for reporting the problem and thanks for the discussion. 
   IMO, the behavior is by design, and the problem can be solved by calling the `setValueCount` method after calling `setNull`.
   Before calling `setValueCount`, the vector is in an undefined state, and can have some unexpected behaviors. 
   
   To use the vectors properly, we need to respect its life-cycle. Details can be found in https://arrow.apache.org/docs/java/vector.html#vector-life-cycle


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer layout is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values), it won't add any data into "value buffer".
   2) reset these buffer reader/writer index (this looks like has some side effect)
   
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763544300


   @liyafan82 And, the "by design" code, your said, 
   
   ```
   while (index >= getValidityBufferValueCapacity()) {
     reallocValidityAndOffsetBuffers();
   }
   ```
   It doesn't looks like a "by design" code but looks like a "negligence". It doesn't save memory. Because you still need to scale up both of validity and offset buffer.
   
   If the original code is "by design" and want to save memory in this case, it should be written like:
   ```
   while (index >= getValidityBufferValueCapacity()) {
     reallocValidityBuffer();
   }
   ```
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763133851


   Thanks for the PR, I'll take a look soon


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-778506839


   @WeichenXu123 we would just need to add the following API to `BaseVariableWidthVector` (and possibly `BaseRepeatedValueVector`)
   
   ```Java
   public int getBufferSizeFor(final int valueCount, double density)
   ```
   The argument `density` is used elsewhere, so we should stick with that to be consistent. Also, the method `public double getDensity()` already exists in `BaseVariableWidthVector`.
   
   As for it being static, it could be, but just as the related methods it might be a good idea to not do this in case a subclass would want to override it in the future. Is it a requirement in your use case to be static?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
kiszk commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-760736536


   Is it possible to add a test case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-770454112


   @BryanCutler @liyafan82 is there consensus on how to move forward with this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-783760535


   Ok I made https://issues.apache.org/jira/browse/ARROW-11739 to add the API discussed here. As for the changes here, I think they are fine because the offset vector is being reallocated along with the validity buffer, so it only makes sense to ensure they can both hold the same value count. I'll go ahead and merge.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer format is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values)
   2) reset offset reader/writer index (this looks like has some side effect)
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
kiszk commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762336602


   Thank you for your clarification. Looks good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HyukjinKwon edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762127094


   I am not a committer in this project so I can't tell. What I can say is that: this seems a bug fix. This fix came from our internal fork to contribute back, and this is being tested under our internal fork.
   
   cc @BryanCutler from the git blame, and who speaks Spark too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-764197687






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-766152565


   I think the intention of `getBufferSizeFor(final int valueCount)` is to provide an estimated buffer size of the vector and it doesn't make sense that the vector should have to be in a certain kind of state to get that estimate. And even calling `setValueCount()` doesn't provide a good estimate since that will just fill empty data. Since this is a variable width vector, it also doesn't make sense to try to get that estimate from a `valueCount` alone.
   
   A better way to get an estimate of buffer size would be to include a `density` value for the avg number of bytes per record, similar to `setInitialCapacity(int valueCount, double density)`. You could then get the density from a previous vector and use that to estimate the size for the next vector:
   
   ```java
   int batch_size = 123;
   double prev_density = prev_vector.getDensity();
   int estimated_size = new_vector.getBufferSizeFor(batch_size, prev_density);
   ```
   
   This new `getBufferSizeFor()` does not need to be in any kind of state, and `setValueCount()` would not need to be called before hand. What are your guys thoughts on this, and does that work for your use case @WeichenXu123 ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer format is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values)
   2) reset offset reader/writer index (this looks like has some side effect)
   
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer format is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values)
   2) reset offset reader/writer index (this looks like has some side effect)
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763540715


   >  In a streaming scenario, in which you fill up batches up to a given size, you need to check the buffer size without actually closing the batch.
   
   Yes. and I think this is the scenario where we call `vector.getBufferSizeFor`
   
   i.e., before we call `setValueCount`, we won't use the vector, but we may call `vector.getBufferSizeFor` to monitor its buffer size. We need ensure `vector.getBufferSizeFor` works before we calling `vector.setValueCount`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9187: ARROW-11223: JAVA Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-759312899


   https://issues.apache.org/jira/browse/ARROW-11223


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HyukjinKwon edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762127094


   I am not a committer in this project so I can't tell. What I can say is that: this seems a bug fix. This fix came from our internal fork to contribute back, and this change has been tested under our internal fork.
   
   cc @BryanCutler from the git blame, and who speaks Spark too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765383673


   > @liyafan82
   > 
   > > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector.
   > 
   > Why not possible ? For variableWidthVector, The buffer layout is quite simple:
   > 
   > 1. validity buffer: nbytes = ceil(N/8)
   > 2. offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   > 3. value buffer: nbytes = sum of all non-null value bytes count
   > 
   > For `variableWidthVector`, the `setValueCount` method only do additional things of:
   > 
   > 1. fill holes (fill the offset buffer gap for last consecutive NULL values), it won't add any data into "value buffer".
   > 2. reset these buffer reader/writer index (this looks like has some side effect)
   > 
   > But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   > 
   > Does it make sense ?
   
   The purpose of calling `setValueCount` is to bring the vector into a consistent state, which makes it possible for subsequent method calls to be successful. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-766499542


   > A better way to get an estimate of buffer size would be to include a density value for the avg number of bytes per record, similar to setInitialCapacity(int valueCount, double density). You could then get the density from a previous vector and use that to estimate the size for the next vector
   
   This looks a. valuable proposal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-770528506


   I think we can simply the interface @BryanCutler proposed. We can make it to be static method, for example:
   
   For String/Binary type:
   
   `VarCharVector.getBufferSize(valueCount, avgValueBytesLen)`
   
   and provide a method to get average value bytes length:
   
   `varCharVecInstance.getAvgValueBytesLen()`
   
   For fixed width type, such as Int,
   
   `IntVector.getBufferSize(valueCount)`
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HyukjinKwon commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-762127094


   I am not a committer in this project so I can't tell. What I can say is that: this seems a bug fix. This fix came from our internal fork, and this is being tested under our internal fork.
   
   cc @BryanCutler from the GitBlame, and who speaks Spark too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-770515468


   > @BryanCutler @liyafan82 is there consensus on how to move forward with this PR?
   
   @BryanCutler 's proposal sounds good to me. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bogdanghit commented on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
bogdanghit commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-763451111


   > Thanks for reporting the problem and thanks for the discussion.
   > IMO, the behavior is by design, and the problem can be solved by calling the `setValueCount` method after calling `setNull`.
   > Before calling `setValueCount`, the vector is in an undefined state, and can have some unexpected behaviors.
   > 
   > To use the vectors properly, we need to respect its life-cycle. Details can be found in https://arrow.apache.org/docs/java/vector.html#vector-life-cycle
   
   @liyafan82 wouldn't `setValueCount` close the batch? In a streaming scenario, in which you fill up batches up to a given size, you need to check the buffer size without actually closing the batch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] WeichenXu123 edited a comment on pull request #9187: ARROW-11223: [Java] Fix: BaseVariableWidthVector/BaseLargeVariableWidthVector setNull() and getBufferSizeFor() trigger offset buffer overflow

Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-765076360


   @liyafan82 
   
   > The reason is that, for variable width vectors, it is not possible to estimate the buffer size without actually filling up the vector. 
   
   Why not possible ? For variableWidthVector, The buffer layout is quite simple:
   1) validity buffer: nbytes = ceil(N/8) 
   2) offset buffer: nbytes = OFFSET_WIDTH * (N + 1)
   3) value buffer: nbytes = sum of all non-null value bytes count
   
   For `variableWidthVector`, the `setValueCount` method only do additional things of:
   1) fill holes (fill the offset buffer gap for last consecutive NULL values), it won't add any data into "value buffer".
   2) reset offset reader/writer index (this looks like has some side effect)
   
   But, without calling `setValueCount`, it doesn't affect we compute the "vector buffer size"
   
   Does it make sense ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9187: [ARROW-11223] Fix: BaseVariableWidthVector setNull and getBufferSizeFor is buggy, may get error java.lang.IndexOutOfBoundsException: index: 15880, length: 4 (expected: range(0, 15880))

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9187:
URL: https://github.com/apache/arrow/pull/9187#issuecomment-759308200


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org