You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/18 10:12:27 UTC

[GitHub] [arrow] pprudhvi opened a new pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

pprudhvi opened a new pull request #7214:
URL: https://github.com/apache/arrow/pull/7214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] awildturtok commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
awildturtok commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-819479092


   Not trying to be obtuse, but how would I do that? Does it even make sense to await an answer from @jacques-n  if they've not responded in about a year? @pprudhvi did good work, maybe someone else can review his work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pprudhvi closed pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
pprudhvi closed pull request #7214:
URL: https://github.com/apache/arrow/pull/7214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jacques-n commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
jacques-n commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426748450



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       shouldn't this be (valueCount + 1) * OFFSET_WIDTH?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] awildturtok commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
awildturtok commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-816776301


   Why has this PR been closed? We are still experiencing this issue on v3.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] awildturtok edited a comment on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
awildturtok edited a comment on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-819479092


   Not trying to be obtuse, but how would I do that? Does it even make sense to await an answer from @jacques-n  if they've not responded in about a year? @pprudhvi did good work, maybe someone else can review their work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] awildturtok commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
awildturtok commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-852158288


   Ahoi,
   
   we are experiencing issues that we feel is related to this PR/a workaround we are trying to do basically the following:
   
   ```{java}
   private static RowConsumer listVectorFiller(ListVector vector, int RowNumber, List<String> values){
       // Values is a vertical list
   
       int start = vector.startNewValue(rowNumber);
       final FieldVector innerVector = vector.getDataVector();
   
       for (int i = 0; i < values.size(); i++) {
           String value = values.get(i);
           innerVector.setSafe(start + i, new Text(value));        
       }
   
       // Workaround for https://issues.apache.org/jira/browse/ARROW-8842
       int valueCount = innerVector.getValueCount();
       innerVector.setValueCount(valueCount + values.size()); // ie grow the innerVector by the inner values
   
       vector.endValue(rowNumber,values.size());
   }
   ```
   
   We are generating an Arrow file that rendered as CSV has  34MB, but as arrs/arrf comes in a 1GB also taking quite long to generate. Since it also contains a crude fix for this PR we wanted to make sure if this is the proper way of creating a ListVector.
   
   I've attached a flamegraph of the generation: [arrow-download.svg.zip](https://github.com/apache/arrow/files/6576795/arrow-download.svg.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-630087774


   https://issues.apache.org/jira/browse/ARROW-8842


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] awildturtok edited a comment on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
awildturtok edited a comment on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-852158288


   Ahoi,
   
   we are experiencing issues that we feel is related to this PR/a workaround we are trying to do basically the following:
   
   ```{java}
   private static RowConsumer listVectorFiller(ListVector vector, int RowNumber, List<String> values){
       // Values is a vertical list
   
       int start = vector.startNewValue(rowNumber);
       final FieldVector innerVector = vector.getDataVector();
   
       for (int i = 0; i < values.size(); i++) {
           String value = values.get(i);
           innerVector.setSafe(start + i, new Text(value));        
       }
   
       // Workaround for https://issues.apache.org/jira/browse/ARROW-8842
       int valueCount = innerVector.getValueCount();
       innerVector.setValueCount(valueCount + values.size()); // ie grow the innerVector by the inner values
   
       vector.endValue(rowNumber,values.size());
   }
   ```
   
   We are generating an Arrow file that rendered as CSV has  34MB, but as arrs/arrf comes in a 1GB also taking quite long to generate. Since it also contains a crude fix for this PR we wanted to make sure if this is the proper way of creating a ListVector.
   
   I've attached a flamegraph of the generation: [arrow-download.svg.zip](https://github.com/apache/arrow/files/6576795/arrow-download.svg.zip)
   
   After reading then writing the file again using python, the file is only 11MB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pprudhvi commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
pprudhvi commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426749944



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       valueCount starts from 1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jacques-n commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
jacques-n commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r613386671



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       I think the bug was actually that last set is not updated just above (fill the holes with offsets populates so more items are last set but doesn't actually update the last set value). If the vector was internally consistent, lastSet + 1 should be equal to valueCount. LastSet is zero-index and valuecount is 1-index so I believe the fix is correct. I think, probably that lastSet should also be updated at the end of the "fill the holes with offsets" section as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pprudhvi commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
pprudhvi commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426753192



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       yes, in that case valueCount will be 1 which gives the # elements in the first value = # values in data vector. The problem is when one of the elements in the first value is null, will data vector store null? should we consider validity buffer of data vector? this was probably why praveen made that change 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thoniTUB commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
thoniTUB commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-877202519


   Hey, I found the problem in our implementation that @awildturtok described, which caused the large files: The call to `VectorSchemaRoot::clear` was missing after writing a batch. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r493188671



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       @jacques-n does this explanation make sense to you, what are the next steps here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-819616996


   I reviewed.  I'm going to reopen and merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-817454551


   I don't know why @pprudhvi closed it.  There seemed to be a stall in the coversation with @jacques-n  if you would like to resurrect it please go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jacques-n commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
jacques-n commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426751561



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       Wait, scratch that. If you have one value, you need two offsets, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pprudhvi commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
pprudhvi commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426753192



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       yes, in that case valueCount will be 1 which gives the # elements in the first value = # values in data vector




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jacques-n commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
jacques-n commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426751112



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       yeah, good point :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pprudhvi commented on a change in pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
pprudhvi commented on a change in pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#discussion_r426753192



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
##########
@@ -844,7 +844,7 @@ public void setValueCount(int valueCount) {
     }
     /* valueCount for the data vector is the current end offset */
     final int childValueCount = (valueCount == 0) ? 0 :
-            offsetBuffer.getInt((lastSet + 1) * OFFSET_WIDTH);
+            offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

Review comment:
       yes, in that case valueCount will be 1 which gives the length of the first value = # values in data vector




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7214: ARROW-8842: [Java] fix ListVector's setValueCount to set inner vector's value count correctly

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7214:
URL: https://github.com/apache/arrow/pull/7214#issuecomment-770454578


   Is this fix still necessary?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org