You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2021/01/16 21:39:00 UTC

[jira] [Updated] (ARROW-10174) [Java] Reading of Dictionary encoded struct vector fails

     [ https://issues.apache.org/jira/browse/ARROW-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krisztian Szucs updated ARROW-10174:
------------------------------------
    Fix Version/s:     (was: 2.0.0)
                   3.0.0

> [Java] Reading of Dictionary encoded struct vector fails 
> ---------------------------------------------------------
>
>                 Key: ARROW-10174
>                 URL: https://issues.apache.org/jira/browse/ARROW-10174
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.0.1
>            Reporter: Benjamin Wilhelm
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
> Write an index vector and a dictionary with a dictionary vector of the type {{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an exception.
> Code to reproduce:
> {code:java}
> final RootAllocator allocator = new RootAllocator();
> // Create the dictionary
> final StructVector dict = StructVector.empty("Dict", allocator);
> final NullableStructWriter dictWriter = dict.getWriter();
> final IntWriter dictA = dictWriter.integer("a");
> final IntWriter dictB = dictWriter.integer("b");
> for (int i = 0; i < 3; i++) {
> 	dictWriter.start();
> 	dictA.writeInt(i);
> 	dictB.writeInt(i);
> 	dictWriter.end();
> }
> dict.setValueCount(3);
> final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, false, null));
> // Create the vector
> final Random random = new Random();
> final StructVector vector = StructVector.empty("Dict", allocator);
> final NullableStructWriter vectorWriter = vector.getWriter();
> final IntWriter vectorA = vectorWriter.integer("a");
> final IntWriter vectorB = vectorWriter.integer("b");
> for (int i = 0; i < 10; i++) {
> 	int v = random.nextInt(3);
> 	vectorWriter.start();
> 	vectorA.writeInt(v);
> 	vectorB.writeInt(v);
> 	vectorWriter.end();
> }
> vector.setValueCount(10);
> // Encode the vector using the dictionary
> final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, dictionary);
> // Write the vector to out
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final VectorSchemaRoot root = new VectorSchemaRoot(Collections.singletonList(indexVector.getField()),
> 		Collections.singletonList(indexVector));
> final ArrowStreamWriter writer = new ArrowStreamWriter(root, new MapDictionaryProvider(dictionary),
> 		Channels.newChannel(out));
> writer.start();
> writer.writeBatch();
> writer.end();
> // Read the vector from out
> try (final ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()),
> 		allocator)) {
> 	reader.loadNextBatch();
> 	final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
> 	final FieldVector readIndexVector = readRoot.getVector(0);
> 	// Get the dictionary and decode
> 	final Map<Long, Dictionary> readDictionaryMap = reader.getDictionaryVectors();
> 	final Dictionary readDictionary = readDictionaryMap.get(readIndexVector.getField().getDictionary().getId());
> 	final ValueVector readVector = DictionaryEncoder.decode(readIndexVector, readDictionary);
> }
> {code}
> Exception:
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, length:12]
> 	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63)
> 	at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241)
> 	at org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232)
> 	at org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129)
> 	at com.knime.AppTest.testDictionaryStruct(AppTest.java:83)
> {code}
> If I see it corretly the error happens in {{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is encountered still the children of the memory format field are used (none because this is Int). However, the children of the field of the dictionary vector should be mapped to the message format and set as children.
> I can create a fix and open a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)