You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/27 03:22:45 UTC

[GitHub] [arrow] liyafan82 opened a new pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

liyafan82 opened a new pull request #6425:
URL: https://github.com/apache/arrow/pull/6425


   These types of vectors are also variable width vectors. However, they have a offset width of 8, so the underlying data buffer can be over 2GB in size. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-629421269


   @BryanCutler I believe the integration tests couple LargeList with LargeVarChar/LargeBinary, so both need to be implemented to enable the integration tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler edited a comment on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler edited a comment on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-632887913


   Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error?
   ```
   Error accessing files
   Current token (VALUE_STRING) not numeric, can not use numeric value accessors
    at [Source: (File); line: 65, column: 14]
   12:46:37.776 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
   com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
    at [Source: (File); line: 65, column: 14]
   	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
   	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:698)
   	at com.fasterxml.jackson.core.base.ParserBase._parseNumericValue(ParserBase.java:781)
   	at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:799)
   	at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
   	at org.apache.arrow.vector.ipc.JsonFileReader$BufferHelper$5.read(JsonFileReader.java:306)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-621827317


   I have added integration test for the large varchar vector. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#discussion_r431542563



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorVisitor.java
##########
@@ -90,6 +91,11 @@ public Void visit(BaseVariableWidthVector vector, Void value) {
     return null;
   }
 
+  @Override
+  public Void visit(BaseLargeVariableWidthVector left, Void value) {
+    return null;

Review comment:
       Sure. I will fix this problem in ARROW-8402, and will open other JIRAs to track other TODOs. Thanks for your reminder. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-624426890


   @nealrichardson and @BryanCutler Thanks for your good suggestion.
   I have removed the skip in `datagen.py`, but the tests failed. The reason is that we do not support `LargeList` type yet:
   
   ```
   2020-05-06T03:03:50.3032543Z com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Could not resolve type id 'largelist' as a subtype of [simple type, class org.apache.arrow.vector.types.pojo.ArrowType]: known type ids = [binary, bool, date, decimal, duration, fixedsizebinary, fixedsizelist, floatingpoint, int, interval, largebinary, largeutf8, list, map, null, struct, time, timestamp, union, utf8] (for POJO property 'type')
   2020-05-06T03:03:50.3033308Z  at [Source: (File); line: 7, column: 19] (through reference chain: org.apache.arrow.vector.types.pojo.Schema["fields"]->java.util.ArrayList[0]->org.apache.arrow.vector.types.pojo.Field["type"])
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-634963132


   merged to master, thanks @liyafan82 !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-629016911


   @BryanCutler @siddharthteotia I think I'm OK merging if this is you are happy with the code and following up on integation tests as part of ARROW-6110?  What do you two think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-622458158


   If you're doing integration tests as part of this patch, please remove this skip: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1446


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-633883684


   @BryanCutler The integration tests pass now. Please take another look when you have time. Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-630448066


   @BryanCutler you are correct, for some reason I thought LargeList was coupled with LargeVarChar/LargeBinary.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-632982040


   > Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error?
   > 
   > ```
   > Error accessing files
   > Current token (VALUE_STRING) not numeric, can not use numeric value accessors
   >  at [Source: (File); line: 65, column: 14]
   > 12:46:37.776 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
   > com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
   >  at [Source: (File); line: 65, column: 14]
   > 	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
   > 	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:698)
   > 	at com.fasterxml.jackson.core.base.ParserBase._parseNumericValue(ParserBase.java:781)
   > 	at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:799)
   > 	at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
   > 	at org.apache.arrow.vector.ipc.JsonFileReader$BufferHelper$5.read(JsonFileReader.java:306)
   > ```
   
   Thank you @BryanCutler . I am investigating. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 closed pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 closed pull request #6425:
URL: https://github.com/apache/arrow/pull/6425


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-634447500


   > Great job getting integration tests passing @liyafan82 ! I took a quick look and LGTM, will do a more thorough pass hopefully tomorrow before merging.
   
   @BryanCutler Please take your time. Thanks a lot for your effort. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-622555307


   Yes, please enable integration tests and lets make sure it passes before merging this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-630444330


   The `generate_primitive_large_offsets_case` looks like it is just testing 'largebinary', 'largeutf8'. Are large lists somehow part of that?
   
   ```python
   def generate_primitive_large_offsets_case(batch_sizes):
       types = ['largebinary', 'largeutf8']
   
       fields = []
   
       for type_ in types:
           fields.append(get_field(type_ + "_nullable", type_, nullable=True))
           fields.append(get_field(type_ + "_nonnullable", type_, nullable=False))
   
       return _generate_file('primitive_large_offsets', fields, batch_sizes
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-635050822


   @BryanCutler @siddharthteotia @emkornfield I am aware that this PR is large, and consumes lots of effort. So thanks a lot for your effort and good comments!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-632887913


   Thanks @liyafan82 , looks like they didn't pass on this first try. Any idea what was causing the error:
   ```
   Error accessing files
   Current token (VALUE_STRING) not numeric, can not use numeric value accessors
    at [Source: (File); line: 65, column: 14]
   12:46:37.776 [main] ERROR org.apache.arrow.tools.Integration - Error accessing files
   com.fasterxml.jackson.core.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
    at [Source: (File); line: 65, column: 14]
   	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
   	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:698)
   	at com.fasterxml.jackson.core.base.ParserBase._parseNumericValue(ParserBase.java:781)
   	at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:799)
   	at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
   	at org.apache.arrow.vector.ipc.JsonFileReader$BufferHelper$5.read(JsonFileReader.java:306)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 edited a comment on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 edited a comment on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-624426890


   @nealrichardson and @BryanCutler Thanks for your good suggestion.
   I have removed the skip in `datagen.py`, but the tests failed. The reason is that we do not support `LargeList` type yet:
   
   ```
   2020-05-06T03:03:50.3032543Z com.fasterxml.jackson.databind.exc.InvalidTypeIdException: Could not resolve type id 'largelist' as a subtype of [simple type, class org.apache.arrow.vector.types.pojo.ArrowType]: known type ids = [binary, bool, date, decimal, duration, fixedsizebinary, fixedsizelist, floatingpoint, int, interval, largebinary, largeutf8, list, map, null, struct, time, timestamp, union, utf8] (for POJO property 'type')
   2020-05-06T03:03:50.3033308Z  at [Source: (File); line: 7, column: 19] (through reference chain: org.apache.arrow.vector.types.pojo.Schema["fields"]->java.util.ArrayList[0]->org.apache.arrow.vector.types.pojo.Field["type"])
   ```
   
   I have revised the comment in `datagen.py` accordingly. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on a change in pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#discussion_r431461407



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/validate/ValidateVectorVisitor.java
##########
@@ -90,6 +91,11 @@ public Void visit(BaseVariableWidthVector vector, Void value) {
     return null;
   }
 
+  @Override
+  public Void visit(BaseLargeVariableWidthVector left, Void value) {
+    return null;

Review comment:
       is this another TODO? Would you mind making JIRAs for this and the other TODOs here so they can be tracked?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-629419409


   I think you should be removing the skip Java here https://github.com/apache/arrow/blob/2f72713446b04f8979b04f907e7185985028b0a8/dev/archery/archery/integration/datagen.py#L1480 to enable integration testing for this.  I'll try to take another review pass early next week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #6425:
URL: https://github.com/apache/arrow/pull/6425#issuecomment-632603791


   @BryanCutler @emkornfield Sorry for my late response. 
   I have removed the skip. Let's see if the integration tests can pass this time. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] BryanCutler closed pull request #6425: ARROW-6111: [Java] Support LargeVarChar and LargeBinary types

Posted by GitBox <gi...@apache.org>.
BryanCutler closed pull request #6425:
URL: https://github.com/apache/arrow/pull/6425


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org