You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Arrow User (Jira)" <ji...@apache.org> on 2022/03/18 17:14:00 UTC

[jira] [Comment Edited] (ARROW-15971) [C++/Java] Error when reading inner lists within a struct in empty outer lists from C++/Python in Java

    [ https://issues.apache.org/jira/browse/ARROW-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508876#comment-17508876 ] 

Arrow User edited comment on ARROW-15971 at 3/18/22, 5:13 PM:
--------------------------------------------------------------

Performed some early investigation on the potential issue, if it helps:
In C++, the inner list consists of a single offset buffer [0], as no append is ever called on the inner list. Note that, in this case, the outer list is empty, and therefore struct does not append empty values to its children.

The memory layout for the list items is shown as:
{noformat}
array_1: list<item: struct<array_sub_col: list<item: string>>>         < -         offset: [0, 0] ; offset capacity: 8;  value size/capacity: 1;
  child 0, item: struct<array_sub_col: list<item: string>>             
      child 0, array_sub_col: list<item: string>                       < -         offset: [0]    ; offset capacity: 4;  value size/capacity: 0;
          child 0, item: string
{noformat}

C++ explicitly fails on zero length offsets: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_nested.cc#L111, which means that populating a single int32 offset buffer on lists that are never appended to before finishing the builder to an array is expected. C++ also relies on the offset length (offset length - 1) to denote the length of the array to construct during reading on various parts (e.g. ListArrayFromArrays).
A fix for Java on 7.0.0 was introduced: https://github.com/apache/arrow/commit/13b66b57b454d2b6c4ea35e3d19adbdd85b17810 to handle 0 offset capacity, but the issue remains with the latest versions.
Therefore, would it be expected that the Java code assigns a current and max offset positioning of 0, when encountered with offset builders of capacity **<=4 (or ==0 || ==4)**? See https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java#L59:

 {code:java}public void setPosition(int index) {
    super.setPosition(index);
-   if (vector.getOffsetBuffer().capacity() == 0) {
+   if (vector.getOffsetBuffer().capacity() <= OFFSET_WIDTH) {
      currentOffset = 0;
      maxOffset = 0;
    } else { ...
{code}


was (Author: JIRAUSER286794):
Performed some early investigation on the potential issue, if it helps:
In C++, the inner list consists of a single offset buffer [0], as no append is ever called on the inner list. Note that, in this case, the outer list is empty, and therefore struct does not append empty values to its children.

The memory layout for the list items is shown as:
{noformat}
array_1: list<item: struct<array_sub_col: list<item: string>>>         < -         offset: [0, 0] ; offset capacity: 8;  value size/capacity: 1;
  child 0, item: struct<array_sub_col: list<item: string>>             
      child 0, array_sub_col: list<item: string>                       < -         offset: [0]    ; offset capacity: 4;  value size/capacity: 0;
          child 0, item: string
{noformat}

C++ explicitly fails on zero length offsets: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_nested.cc#L111, which means that populating a single int32 offset buffer on lists that are never appended to before finishing the builder to an array is expected. C++ also relies on the offset length (offset length - 1) to denote the length of the array to construct during reading on various parts (e.g. ListArrayFromArrays).
A fix for Java on 7.0.0 was introduced: https://github.com/apache/arrow/commit/13b66b57b454d2b6c4ea35e3d19adbdd85b17810 to handle 0 offset capacity, but the issue remains with the latest versions.
Therefore, would it be expected that the Java code assigns a current and max offset positioning of 0, when encountered with offset builders of capacity `<=4` (or ==0 || ==4)? See https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java#L59:

 {code:java}public void setPosition(int index) {
    super.setPosition(index);
-   if (vector.getOffsetBuffer().capacity() == 0) {
+   if (vector.getOffsetBuffer().capacity() <= OFFSET_WIDTH) {
      currentOffset = 0;
      maxOffset = 0;
    } else { ...
{code}

> [C++/Java] Error when reading inner lists within a struct in empty outer lists from C++/Python in Java
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15971
>                 URL: https://issues.apache.org/jira/browse/ARROW-15971
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Java, Python
>    Affects Versions: 7.0.0
>            Reporter: Arrow User
>            Priority: Major
>              Labels: bug
>
> h2. Description
> When using C++ (or Python) to construct a null or empty outer array of type *array_1: list<item: struct<array_sub_col: list<item: string>>>*, either:
> {noformat}
> -  array_1: null
> -  array_1: []
> {noformat}
> an out of bounds exceptions (see stack trace below) follows when later retrieving the field reader for the inner list (*array_sub_col*) in Java, when trying to access the subsequent offset buffer: https://github.com/apache/arrow/blob/release-4.0.1/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java#L60
> h2. Reproduction
> *Java*: 7.0.0
> *C++*: 4.0.0
> *Python*: 7.0.0
> Creating a stream on C++ of type *array_1: list<item: struct<array_sub_col: list<item: string>>>* with an empty (or null) outer list:
> {code:c++}arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Result<std::shared_ptr<arrow::io::BufferOutputStream>> stream_buffer =
>     arrow::io::BufferOutputStream::Create(1, pool);
> std::vector<std::shared_ptr<arrow::Field>> inner_list_field\{std::make_shared<arrow::Field>("array_sub_col",arrow::list(arrow::utf8()))};
> // Datatype for the builder: list<struct<list<string>>>
> std::shared_ptr<DataType> data_type = list(struct_(inner_list_field));
> std::unique_ptr<arrow::ArrayBuilder> builder;
> arrow::MakeBuilder(pool, data_type, &builder);
> auto* list_builder = dynamic_cast<arrow::ListBuilder*>(builder.get());
> // Append a null or an empty list to the outer list
> list_builder->AppendNull(); // or list_builder->AppendEmptyValue()
> std::vector<std::shared_ptr<arrow::Array>> value_batch;
> value_batch.resize(1);
> list_builder->Finish(&value_batch[0]);
> std::vector<std::shared_ptr<arrow::Field>> outer_list_field\{std::make_shared<arrow::Field>("array_1",data_type)};
> auto schema = std::make_shared<arrow::Schema>(outer_list_field);
> // Build a single row record batch
> std::shared_ptr<arrow::RecordBatch> batch = RecordBatch::Make(schema, 1, value_batch);
> ASSERT_OK(batch->Validate());
> // Stream the batch to a file to later read on the Java side
> arrow::Result<std::shared_ptr<ipc::RecordBatchWriter>> stream_writer = 
>     arrow::ipc::MakeStreamWriter(stream_buffer.ValueOrDie().get(), schema, arrow::ipc::IpcWriteOptions::Defaults());
> stream_writer.ValueOrDie()->WriteRecordBatch(*batch);
> arrow::Result<std::shared_ptr<arrow::Buffer>> buffer_result = stream_buffer.ValueOrDie()->Finish();
> std::shared_ptr<arrow::Buffer> buffer = buffer_result.ValueOrDie();
> auto file_output = arrow::io::FileOutputStream::Open("/tmp/batch_stream.out").ValueOrDie();
> file_output->Write(buffer->data(), buffer->size());
> file_output->Close();
> {code}
> As expected, Python holds the same memory layout for the field vectors as the code above:
> {code:python}array = pa.array([None], type=pa.list_(pa.struct([pa.field("array_sub_col", pa.list_(pa.utf8()))])))
> batch = pa.record_batch([struct_array], names=["array_1"])
> sink = pa.BufferOutputStream()
> with pa.ipc.new_stream(sink, batch.schema) as writer:
>     writer.write_batch(batch)
> buf = sink.getvalue()
> with open('/tmp/batch_stream.out', 'wb') as f:
>     f.write(buf)
> {code}
> *Java fails when then trying to access the inner list's field reader:*
> {code:java}File file = new File("/tmp/batch_stream.out");
> byte[] bytes = FileUtils.readFileToByteArray(file);
> try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(bytes), allocator)) {
>      Schema schema = reader.getVectorSchemaRoot().getSchema();
>      reader.loadNextBatch();
>      readBatch.getVector("array_1").getReader().reader().reader("array_sub_col");                     // <- fails: reader("array_sub_col") fails with OOB
>      // Concrete readers:
>      // FieldVector array_1 = readBatch.getVector("array_1");
>      // UnionListReader array_1_reader = (UnionListReader) array_1.getReader();
>      // NullableStructReaderImpl struct_reader = (NullableStructReaderImpl) array_1_reader.reader();
>      // FieldReader union_list_reader = struct_reader.reader("array_sub_col");                        // <- fails: OOB
> {code}
> h3. Stack trace:
> {noformat}
> java.lang.reflect.InvocationTargetException
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
>     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke (Method.java:566)
>     at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
>     at java.lang.Thread.run (Thread.java:829)
> Caused by: java.lang.IndexOutOfBoundsException: index: 4, length: 4 (expected: range(0, 4))
>     at org.apache.arrow.memory.ArrowBuf.checkIndexD (ArrowBuf.java:318)
>     at org.apache.arrow.memory.ArrowBuf.chk (ArrowBuf.java:305)
>     at org.apache.arrow.memory.ArrowBuf.getInt (ArrowBuf.java:424)
>     at com.test.arrow.ValidateArrow.testArrow (ValidateArrow.java:433)
>     at com.test.arrow.ValidateArrow.main (ValidateArrow.java:440)
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
>     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke (Method.java:566)
>     at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:297)
>     at java.lang.Thread.run (Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)