You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Omer Ozarslan (Jira)" <ji...@apache.org> on 2019/08/28 17:02:00 UTC

[jira] [Created] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

Omer Ozarslan created ARROW-6377:
------------------------------------

             Summary: [C++] Extending STL API to support row-wise conversion
                 Key: ARROW-6377
                 URL: https://issues.apache.org/jira/browse/ARROW-6377
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Omer Ozarslan


Using array builders is the recommended way in the documentation for converting rowwise data to arrow tables currently. However, array builders has a low level interface to support various use cases in the library. They require additional boilerplate due to type erasure, although some of these boilerplate could be avoided in compile time if the schema is already known and fixed (also discussed in ARROW-4067).

In some other part of the library, STL API provides a nice abstraction over builders by inferring data type and builders from values provided, reducing the boilerplate significantly. It handles automatically converting tuples with a limited set of native types currently: numeric types, string and vector (+ nullable variations of these in case ARROW-6326 is merged). It also allows passing references in tuple values (implemented recently in ARROW-6284).

As a more concrete example, this is the code which can be used to convert {{row_data}} provided in examples:
  
{code:cpp}
arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>& rows,
                                       std::shared_ptr<arrow::Table>* table) {
    auto rng = rows | ranges::views::transform([](const data_row& row) {
                   return std::tuple<int, double, const std::vector<double>&>(
                       row.id, row.cost, row.cost_components);
               });
    return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
                                           {"id", "cost", "cost_components"},
                                           table);
}

{code}
So, it allows more concise code for consumers of the API compared to using builders directly.

There is no direct support by the library for other types (binary, struct, union etc. types or converting iterable objects other than vectors to lists). Users are provided a way to specialize their own data structures. One limitation for implicit inference is that it is hard (or even impossible) to infer exact type to use in some cases. For example, should {{std::string_view}} value be inferred as string, binary, large binary or list? This ambiguity can be avoided by providing some way for user to explicitly state correct type for storing a column. For example a user can return a so called {{BinaryCell}} class to return binary values.

Proposed changes:
 * Implementing cell "adapters": Cells are non-owning references for each type. It's user's responsibility keep pointed values alive. (Can scalars be used in this context?)
 ** BinaryCell
 ** StringCell
 ** ListCell (fo adapting any Range)
 ** StructCell
 ** ...
 * Primitive types don't need such adapters since their values are trivial to cast (e.g. just use int8_t(value) to use Int8Type).
 * Adding benchmarks for comparing with builder performance. There is likely to be some performance penalty due to hindering compiler optimizations. Yet, this is acceptable in exchange of a more concise code IMHO. For fine-grained control over performance, it will be still possible to directly use builders.

I have implemented something similar to BinaryCell for my use case. If above changes sound reasonable, I will go ahead and start implementing other cells to submit.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)