You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/03/05 19:28:00 UTC
[jira] [Created] (ARROW-11878) [C++] Improve Converter API to
support chunking
Neal Richardson created ARROW-11878:
---------------------------------------
Summary: [C++] Improve Converter API to support chunking
Key: ARROW-11878
URL: https://issues.apache.org/jira/browse/ARROW-11878
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Reporter: Neal Richardson
Fix For: 4.0.0
We would like to be able to chunk a data frame when converting to Arrow Table in R (see ARROW-9293). Apparently this is also not supported in pyarrow.
[~romainfrancois] says two things need to happen:
- Converter api needs to be able to Extend() a range of values, as opposed to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override which says ingest that vector x and btw it has this many elements.
- Chunker or perhaps another/new class would sit on top of that and perhaps {{Chunker::Extend(x)}} would call multiple times (one for each chunk) {{Converter$Extend(x, start, size)}}.
The current chunker solves I believe a different problem and is rooted in a Converter that deals with elements one by one so that:
- if the element can be Append() that’s fine
- if not, then create a new chunk and try again
The current chunker has a multiple element method but it’s an all or nothing:
{code}
// we could get bit smarter here since the whole batch of appendable values
// will be rejected if a capacity error is raised
Status Extend(InputType values, int64_t size) {
auto status = converter_->Extend(values, size);
if (ARROW_PREDICT_FALSE(status.IsCapacityError())) {
if (converter_->builder()->length() == 0) {
return status;
}
ARROW_RETURN_NOT_OK(FinishChunk());
return Extend(values, size);
}
length_ += size;
return status;
}
{code}
This does not give a way to say e.g. take this vector and chunk it into arrays of this size, which is what we want.
cc [~kszucs] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)