You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2021/07/05 08:51:08 UTC

[jira] [Updated] (ARROW-11878) [C++] Improve Converter API to support chunking

     [ https://issues.apache.org/jira/browse/ARROW-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alessandro Molina updated ARROW-11878:
--------------------------------------
    Fix Version/s:     (was: 5.0.0)
                   6.0.0

> [C++] Improve Converter API to support chunking
> -----------------------------------------------
>
>                 Key: ARROW-11878
>                 URL: https://issues.apache.org/jira/browse/ARROW-11878
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Neal Richardson
>            Priority: Major
>             Fix For: 6.0.0
>
>
> We would like to be able to chunk a data frame when converting to Arrow Table in R (see ARROW-9293). Apparently this is also not supported in pyarrow. 
> [~romainfrancois] says two things need to happen: 
>  - Converter api needs to be able to Extend() a range of values, as opposed to the current api we have : {{Status Extend(SEXP x, int64_t size)}} override which says ingest that vector x and btw it has this many elements. 
>  - Chunker or perhaps another/new class would sit on top of that and perhaps {{Chunker::Extend(x)}} would call multiple times (one for each chunk) {{Converter$Extend(x, start, size)}}. 
> The current chunker solves I believe a different problem and is rooted in a Converter that deals with elements one by one so that: 
>   - if the element can be Append() that’s fine
>   - if not, then create a new chunk and try again
> The current chunker has a multiple element method but it’s an all or nothing: 
> {code}
>   // we could get bit smarter here since the whole batch of appendable values
>   // will be rejected if a capacity error is raised
>   Status Extend(InputType values, int64_t size) {
>     auto status = converter_->Extend(values, size);
>     if (ARROW_PREDICT_FALSE(status.IsCapacityError())) {
>       if (converter_->builder()->length() == 0) {
>         return status;
>       }
>       ARROW_RETURN_NOT_OK(FinishChunk());
>       return Extend(values, size);
>     }
>     length_ += size;
>     return status;
>   }
> {code}
> This does not give a way to say e.g. take this vector and chunk it into arrays of this size, which is what we want. 
> cc [~kszucs] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)