You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Hao Zou (Jira)" <ji...@apache.org> on 2022/01/20 08:41:00 UTC

[jira] (ARROW-15289) Support mutable array.

    [ https://issues.apache.org/jira/browse/ARROW-15289 ]


    Hao Zou deleted comment on ARROW-15289:
    ---------------------------------

was (Author: JIRAUSER283211):
h2. BACKGROUND

As arrow::Array objects are immutable,  they are created by [arrow::ArrayBuilder or arrow::ArrowData|https://arrow.apache.org/docs/cpp/arrays.html#building-an-array]. However, some computing engines need to repeatedly read arrow data from the same reader with fixed schema, or write arrow data to the same writer with fixed schema multiple times. Immutable arrow::Array has a lot of unnecessary overhead in the above scenario. So, we propose a mutable array that is reusable and has the following benefits
 * Avoid extra overhead like repeating array construction, which is significant in multi-column or deeply nested scenarios. 
 * Avoid extra memory fragmentation caused by repeated allocation or release of memory.

h2. API
h3. MutableArray
 * *template <typenameTYPE>* *MutableNumericArray*

// Return mutable raw pointer to the raw value.
value_type* mutable_raw_values()
// Return mutable raw pointer to the null bitmap.
uint8_t* mutable_null_bitmap_data()
// Change Array reported size to indicated size, allocating memory if necessary.
Status Resize(constint64_tnew_nb_elements, boolshrink_to_fit = true)
// Ensure that array has enough memory allocated to fit the indicated.
Status Reserve(constint64_tnew_nb_elements)
*The API of MutableRecordBatch and MutableArrayBuilder is similar to MutableArray.*
h2. E.g

Modify the value of the int64 array in mutable record Batch.
// Create mutable recordBatch with chema <c1: int64>
std::shared_ptr<arrow::MutableRecordBatch> mutable_record_batch = arrow::MutableRecordBatch::Make(schema, 2);

// Get the shared pointer of mutable array
auto int_array = dynamic_pointer_cast<arrow::MutableNumericArray<arrow::Int64Type>>(
  mutable_record_batch->column(0));

// Resize int_array to 1024 rows
int_array->Resize(1024);

// Get the pointer to mutable raw value
int64_t* raw_value = int_array->mutable_raw_values();

// Get the pointer to mutable null_bitmap
uint8_t* null_bitmap = int_array->mutable_null_bitmap_data();

// Modify the value of int64 array
for (int i = 0; i < 1024; ++i) \{
  if (i % 2 == 0) {
    raw_value[i] = i;
    arrow::BitUtil::SetBit(null_bitmap, i);
  } else \{
    arrow::BitUtil::ClearBit(null_bitmap, i);
  }
}
[




|https://arrow.apache.org/docs/cpp/arrays.html#building-an-array]

> Support mutable array.
> ----------------------
>
>                 Key: ARROW-15289
>                 URL: https://issues.apache.org/jira/browse/ARROW-15289
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>         Environment: Linux version 3.10.0-327
>            Reporter: Hao Zou
>            Priority: Major
>              Labels: features
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> For scenarios where the record batch needs to be reused, the repeated construction of record batch is expensive. This task is about supporting a mutable recordBatch/array to avoid repeated construction overhead.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)