You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Arthur Passos (Jira)" <ji...@apache.org> on 2022/11/10 20:38:00 UTC

[jira] [Created] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks

Arthur Passos created ARROW-18307:
-------------------------------------

             Summary: [C++] Read list/array data from ChunkedArray with multiple chunks
                 Key: ARROW-18307
                 URL: https://issues.apache.org/jira/browse/ARROW-18307
             Project: Apache Arrow
          Issue Type: Test
          Components: C++
            Reporter: Arthur Passos


I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray.



 
{code:java}
static std::shared_ptr<arrow::ChunkedArray> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i)));
std::shared_ptr<arrow::Array> chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared<arrow::ChunkedArray>(array_vector);
}{code}

This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side.

I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example assumes there is only on chunk and ignores the possibility of it having multiple chunks. It's probably just a detail and the test wasn't actually intended to cover multiple chunks.

I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1)));

auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();

auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);

auto lcv1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.

Hence, my question: How do I properly extract the data & offsets out of such column? A more generic version of this is: how to extract the data out of ChunkedArrays with multiple chunks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)