You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 15:37:40 UTC

[GitHub] [arrow] HedgehogCode opened a new pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

HedgehogCode opened a new pull request #8363:
URL: https://github.com/apache/arrow/pull/8363


   When translating between the memory FieldType and message FieldType for
   dictionary encoded vectors the children of the dictionary field were not
   handled correctly.
   * When going from memory format to message format the Field must have the
     children of the dictionary field.
   * When going from message format to memory format the Field must have no
     children but the dictionary must have the mapped children


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502761624



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,32 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        if (v.get(i) != null) {
+          child.set(i, v.get(i));
+        } else {
+          child.setNull(i);
+        }
+        vector.setIndexDefined(i);

Review comment:
       we should call `setIndexDefined` only if `v.get(i) != null`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502373894



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));
+        vector.setIndexDefined(i);
+      }

Review comment:
       Maybe we also need to set the value count for the child vector




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502761722



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +325,77 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        // Read the encoded vector and check it
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertEquals(encodedVector.getValueCount(), readEncoded.getValueCount());
+        assertTrue(new RangeEqualsVisitor(encodedVector, readEncoded)
+            .rangeEquals(new Range(0, 0, encodedVector.getValueCount())));
+
+        // Read the dictionary
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        // Assert the dictionary vector is correct
+        final FieldVector readDictionaryVector = readDictionary.getVector();
+        assertEquals(dictionaryVector4.getValueCount(), readDictionaryVector.getValueCount());
+        final BiFunction<ValueVector, ValueVector, Boolean> typeComparatorIgnoreName =
+            (v1, v2) -> new TypeEqualsVisitor(v1, false, true).equals(v2);
+        assertTrue("Dictionary vectors are not equal",
+            new RangeEqualsVisitor(dictionaryVector4, readDictionaryVector,
+                typeComparatorIgnoreName)
+                    .rangeEquals(new Range(0, 0, dictionaryVector4.getValueCount())));
+
+        // Assert the decoded vector is correct
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        assertTrue("Decoded vectors are not equal",
+            new RangeEqualsVisitor(vector, readVector, typeComparatorIgnoreName)
+                .rangeEquals(new Range(0, 0, vector.getValueCount())));
+
+        vector.close();
+        readVector.close();

Review comment:
       it would be beneficial to close them through a try-with-resource clause to avoid resource leak?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502369246



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/util/DictionaryUtility.java
##########
@@ -115,25 +118,28 @@ public static Field toMemoryFormat(Field field, BufferAllocator allocator, Map<L
     }
 
     ArrowType type;
+    List<Field> fieldChildren;
     if (encoding == null) {
       type = field.getType();
+      fieldChildren = updatedChildren;
     } else {
       // re-type the field for in-memory format
       type = encoding.getIndexType();
+      fieldChildren = null;

Review comment:
       This seems not necessary?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);

Review comment:
       Maybe we need to verify the encoded vector?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);

Review comment:
       should we validate `readEncoded`?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);

Review comment:
       should we validate the read dictionary vector?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertTrue(readVector instanceof StructVector);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        for (int i = 0; i < vector.getValueCount(); i++) {
+          assertEquals(vector.getObject(i), readVector.getObject(i));
+        }

Review comment:
       Maybe we can compare the two vectors through a `RangeEqualsVisitor` object?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));

Review comment:
       We need to consider the null values in v?

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));
+        vector.setIndexDefined(i);
+      }

Review comment:
       Maybe we also need to set the value count for the child vector




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502370575



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);

Review comment:
       Maybe we need to verify the encoded vector?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-705928888


   > @liyafan82 do you have time to review?
   
   @emkornfield Sure. I will take a look in one or two days. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-704767430


   @liyafan82 do you have time to review?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-709690383


   Merging. Thanks for your effort. @HedgehogCode 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HedgehogCode commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
HedgehogCode commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r503126829



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,32 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        if (v.get(i) != null) {
+          child.set(i, v.get(i));
+        } else {
+          child.setNull(i);
+        }
+        vector.setIndexDefined(i);

Review comment:
       I am not sure.
   The current implementation does not allow setting any struct element to `null`. If we only call `#setIndexDefined` if the child element is not `null` the implementation would not allow (non null) structs with all child values beeing `null`. I am okay with both but I think we should keep the API simple because this method is just there to make tests shorter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502369246



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/util/DictionaryUtility.java
##########
@@ -115,25 +118,28 @@ public static Field toMemoryFormat(Field field, BufferAllocator allocator, Map<L
     }
 
     ArrowType type;
+    List<Field> fieldChildren;
     if (encoding == null) {
       type = field.getType();
+      fieldChildren = updatedChildren;
     } else {
       // re-type the field for in-memory format
       type = encoding.getIndexType();
+      fieldChildren = null;

Review comment:
       This seems not necessary?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r503176421



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +325,77 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        // Read the encoded vector and check it
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertEquals(encodedVector.getValueCount(), readEncoded.getValueCount());
+        assertTrue(new RangeEqualsVisitor(encodedVector, readEncoded)
+            .rangeEquals(new Range(0, 0, encodedVector.getValueCount())));
+
+        // Read the dictionary
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        // Assert the dictionary vector is correct
+        final FieldVector readDictionaryVector = readDictionary.getVector();
+        assertEquals(dictionaryVector4.getValueCount(), readDictionaryVector.getValueCount());
+        final BiFunction<ValueVector, ValueVector, Boolean> typeComparatorIgnoreName =
+            (v1, v2) -> new TypeEqualsVisitor(v1, false, true).equals(v2);
+        assertTrue("Dictionary vectors are not equal",
+            new RangeEqualsVisitor(dictionaryVector4, readDictionaryVector,
+                typeComparatorIgnoreName)
+                    .rangeEquals(new Range(0, 0, dictionaryVector4.getValueCount())));
+
+        // Assert the decoded vector is correct
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        assertTrue("Decoded vectors are not equal",
+            new RangeEqualsVisitor(vector, readVector, typeComparatorIgnoreName)
+                .rangeEquals(new Range(0, 0, vector.getValueCount())));
+
+        vector.close();
+        readVector.close();

Review comment:
       I see. Deep nesting is not good-looking. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HedgehogCode commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
HedgehogCode commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r503140409



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +325,77 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        // Read the encoded vector and check it
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertEquals(encodedVector.getValueCount(), readEncoded.getValueCount());
+        assertTrue(new RangeEqualsVisitor(encodedVector, readEncoded)
+            .rangeEquals(new Range(0, 0, encodedVector.getValueCount())));
+
+        // Read the dictionary
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        // Assert the dictionary vector is correct
+        final FieldVector readDictionaryVector = readDictionary.getVector();
+        assertEquals(dictionaryVector4.getValueCount(), readDictionaryVector.getValueCount());
+        final BiFunction<ValueVector, ValueVector, Boolean> typeComparatorIgnoreName =
+            (v1, v2) -> new TypeEqualsVisitor(v1, false, true).equals(v2);
+        assertTrue("Dictionary vectors are not equal",
+            new RangeEqualsVisitor(dictionaryVector4, readDictionaryVector,
+                typeComparatorIgnoreName)
+                    .rangeEquals(new Range(0, 0, dictionaryVector4.getValueCount())));
+
+        // Assert the decoded vector is correct
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        assertTrue("Decoded vectors are not equal",
+            new RangeEqualsVisitor(vector, readVector, typeComparatorIgnoreName)
+                .rangeEquals(new Range(0, 0, vector.getValueCount())));
+
+        vector.close();
+        readVector.close();

Review comment:
       Yes, I changed it. I did not do it before because I did not want the deep nesting.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HedgehogCode commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
HedgehogCode commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r503266354



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,32 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        if (v.get(i) != null) {
+          child.set(i, v.get(i));
+        } else {
+          child.setNull(i);
+        }
+        vector.setIndexDefined(i);

Review comment:
       I see. I changed it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-704359506


   https://issues.apache.org/jira/browse/ARROW-10174


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-705928888


   > @liyafan82 do you have time to review?
   
   @emkornfield Sure. I will take a look in one or two days. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 closed pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 closed pull request #8363:
URL: https://github.com/apache/arrow/pull/8363


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502371916



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);

Review comment:
       should we validate `readEncoded`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HedgehogCode commented on pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
HedgehogCode commented on pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#issuecomment-706193122


   I implemented the changes and force pushed them.
   
   Thank you for the advice with the `RangeEqualsVisitor`. Using it I discovered another bug in `DictionaryUtility`. The field for the Dictionary was always created with `nullable=false` but this is incorrect because dictionaries can be nullable. Therefore, I replaced this with the nullability read from the schema.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502373348



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));

Review comment:
       We need to consider the null values in v?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502372509



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);

Review comment:
       should we validate the read dictionary vector?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r503178275



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,32 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        if (v.get(i) != null) {
+          child.set(i, v.get(i));
+        } else {
+          child.setNull(i);
+        }
+        vector.setIndexDefined(i);

Review comment:
       For a StructVector, we can set an element to null. For its super class (NonNullableStructVector), we cannot set an element to null. 
   
   Here, I generally prefer setting an element to null if all sub-elements are null, because ValueVectorDataPopulator is a generally-purpose class, and we may use it to test scenarios with null elements. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] liyafan82 commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
liyafan82 commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502373097



##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertTrue(readVector instanceof StructVector);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        for (int i = 0; i < vector.getValueCount(); i++) {
+          assertEquals(vector.getObject(i), readVector.getObject(i));
+        }

Review comment:
       Maybe we can compare the two vectors through a `RangeEqualsVisitor` object?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] HedgehogCode commented on a change in pull request #8363: ARROW-10174: [Java] Fix reading/writing dict structs

Posted by GitBox <gi...@apache.org>.
HedgehogCode commented on a change in pull request #8363:
URL: https://github.com/apache/arrow/pull/8363#discussion_r502432327



##########
File path: java/vector/src/main/java/org/apache/arrow/vector/util/DictionaryUtility.java
##########
@@ -115,25 +118,28 @@ public static Field toMemoryFormat(Field field, BufferAllocator allocator, Map<L
     }
 
     ArrowType type;
+    List<Field> fieldChildren;
     if (encoding == null) {
       type = field.getType();
+      fieldChildren = updatedChildren;
     } else {
       // re-type the field for in-memory format
       type = encoding.getIndexType();
+      fieldChildren = null;

Review comment:
       Correct. I moved the `= null` to the declaration of `fieldChildren`. (Usually I would make `fieldChildren` final and then this would be necessary. However, this seems not to be in line with the general code style in the repository).

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);

Review comment:
       No. Encoding of struct vectors is tested in `TestDictionaryVector#testEncodeStruct`. The encoded vector should be fine.

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);

Review comment:
       Yes. It is interesting to check the `Field`. However, using the `RangeEqualsVisitor` we can check both without any more code. So I did it.

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);

Review comment:
       Yes. I added a check.

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java
##########
@@ -305,6 +321,64 @@ public void testWriteReadWithDictionaries() throws IOException {
     }
   }
 
+  @Test
+  public void testWriteReadWithStructDictionaries() throws IOException {
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    provider.put(dictionary4);
+
+    final StructVector vector = newVector(StructVector.class, "D4", MinorType.STRUCT, allocator);
+    final Map<String, List<Integer>> values = new HashMap<>();
+    // Index:                     0, 2, 1, 2, 1, 0, 0
+    values.put("a", Arrays.asList(1, 3, 2, 3, 2, 1, 1));
+    values.put("b", Arrays.asList(4, 6, 5, 6, 5, 4, 4));
+    setVector(vector, values);
+    FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary4);
+
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Collections2.asImmutableList(encodedVector);
+    try (VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, encodedVector.getValueCount());
+         ByteArrayOutputStream out = new ByteArrayOutputStream();
+         ArrowFileWriter writer = new ArrowFileWriter(root, provider, newChannel(out));) {
+
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+
+      try (SeekableReadChannel channel = new SeekableReadChannel(
+          new ByteArrayReadableSeekableByteChannel(out.toByteArray()));
+          ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
+        final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+        final Schema readSchema = readRoot.getSchema();
+        assertEquals(root.getSchema(), readSchema);
+        assertEquals(1, reader.getDictionaryBlocks().size());
+        assertEquals(1, reader.getRecordBlocks().size());
+
+        reader.loadNextBatch();
+        assertEquals(1, readRoot.getFieldVectors().size());
+        assertEquals(1, reader.getDictionaryVectors().size());
+        
+        final FieldVector readEncoded = readRoot.getVector(0);
+        assertTrue(readEncoded instanceof IntVector);
+        assertEquals(new FieldType(true, MinorType.INT.getType(), dictionary4.getEncoding()),
+            readEncoded.getField().getFieldType());
+
+        final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
+        final Dictionary readDictionary =
+            readDictionaryMap.get(readEncoded.getField().getDictionary().getId());
+        assertNotNull(readDictionary);
+
+        final ValueVector readVector = DictionaryEncoder.decode(readEncoded, readDictionary);
+        assertTrue(readVector instanceof StructVector);
+        assertEquals(vector.getValueCount(), readVector.getValueCount());
+        for (int i = 0; i < vector.getValueCount(); i++) {
+          assertEquals(vector.getObject(i), readVector.getObject(i));
+        }

Review comment:
       Good advice! The `RangeEqualsVisitor` does some checks on the `Field` that were missing before.

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));

Review comment:
       Yes, why not. I added an `if` switch for them.

##########
File path: java/vector/src/test/java/org/apache/arrow/vector/testing/ValueVectorDataPopulator.java
##########
@@ -673,4 +677,28 @@ public static void setVector(FixedSizeListVector vector, List<Integer>... values
     dataVector.setValueCount(curPos);
     vector.setValueCount(values.length);
   }
+
+  /**
+   * Populate values for {@link StructVector}.
+   */
+  public static void setVector(StructVector vector, Map<String, List<Integer>> values) {
+    vector.allocateNewSafe();
+
+    int valueCount = 0;
+    for (final Entry<String, List<Integer>> entry : values.entrySet()) {
+      // Add the child
+      final IntVector child = vector.addOrGet(entry.getKey(),
+          FieldType.nullable(MinorType.INT.getType()), IntVector.class);
+
+      // Write the values to the child
+      child.allocateNew();
+      final List<Integer> v = entry.getValue();
+      for (int i = 0; i < v.size(); i++) {
+        child.set(i, v.get(i));
+        vector.setIndexDefined(i);
+      }

Review comment:
       No. The struct vector sets the value count of its children in `#setValueCount`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org