You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/01/04 14:05:00 UTC
[jira] [Updated] (ARROW-11878) [C++] Improve Converter API to support chunking
[ https://issues.apache.org/jira/browse/ARROW-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Molina updated ARROW-11878:
--------------------------------------
Fix Version/s: 8.0.0
(was: 7.0.0)
> [C++] Improve Converter API to support chunking
> -----------------------------------------------
>
> Key: ARROW-11878
> URL: https://issues.apache.org/jira/browse/ARROW-11878
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Neal Richardson
> Priority: Major
> Fix For: 8.0.0
>
>
> We would like to be able to chunk a data frame when converting to Arrow Table in R (see ARROW-9293). Apparently this is also not supported in pyarrow.
> [~romainfrancois] says two things need to happen:
> - Converter api needs to be able to Extend() a range of values, as opposed to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override which says ingest that vector x and btw it has this many elements.
> - Chunker or perhaps another/new class would sit on top of that and perhaps {{Chunker::Extend(x)}} would call multiple times (one for each chunk) {{Converter$Extend(x, start, size)}}.
> The current chunker solves I believe a different problem and is rooted in a Converter that deals with elements one by one so that:
> - if the element can be Append() that’s fine
> - if not, then create a new chunk and try again
> The current chunker has a multiple element method but it’s an all or nothing:
> {code}
> // we could get bit smarter here since the whole batch of appendable values
> // will be rejected if a capacity error is raised
> Status Extend(InputType values, int64_t size) {
> auto status = converter_->Extend(values, size);
> if (ARROW_PREDICT_FALSE(status.IsCapacityError())) {
> if (converter_->builder()->length() == 0) {
> return status;
> }
> ARROW_RETURN_NOT_OK(FinishChunk());
> return Extend(values, size);
> }
> length_ += size;
> return status;
> }
> {code}
> This does not give a way to say e.g. take this vector and chunk it into arrays of this size, which is what we want.
> cc [~kszucs] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)