You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Laurent Querel (Jira)" <ji...@apache.org> on 2022/07/26 22:16:00 UTC
[jira] [Updated] (ARROW-17220) [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

     [ https://issues.apache.org/jira/browse/ARROW-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Laurent Querel updated ARROW-17220:
-----------------------------------
    Summary: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation  (was: [Proposal] Arrow Intermediate Representation)

> [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17220
>                 URL: https://issues.apache.org/jira/browse/ARROW-17220
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Integration
>            Reporter: Laurent Querel
>            Priority: Major
>
> In the context of this [OTEP|https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md] (OpenTelemetry Enhancement Proposal) I developed an integration layer on top of Apache Arrow (Go and Rust) to {*}facilitate the translation of row-oriented data stream into an arrow-based columnar representation{*}. In this particular case the goal was to translate all OpenTelemetry entities (metrics, logs, or traces) into Apache Arrow records. These entities can be quite complex and their corresponding Arrow schema must be defined on the fly. IMO, this approach is not specific to my particular needs but could be used in many other contexts where there is a need to simplify the integration between a row-oriented source of data and Apache Arrow.
> I know that JSON can be used as a kind of intermediate representation in the context of Arrow. I think current JSON integrations are insufficient to cover the most complex scenarios; e.g. support for most of the Arrow data type, various optimizations (string|binary dictionaries, multi-column sorting), batching, integration with Arrow IPC, compression ratio optimization, ... The object of this proposal is to progressively cover these gaps.
> I am looking to see if the community would be interested in such a contribution. Below are some additional details on the current implementation. All feedback is welcome.
> 10K ft overview of the current implementation:
>  # Developers convert their row oriented stream into records based on the Arrow Intermediate Representation (AIR). At this stage the translation can be quite mechanical but if needed developers can decide for example to translate a map into a struct if that makes sense for them. The current implementation support the following arrow data types: bool, all uints, all ints, all floats, string, binary, list of any supported types, and struct of any supported types. Additional Arrow types could be added progressively.
>  # The row oriented record (i.e. AIR record) is then added to a RecordRepository. This repository will first compute a schema signature and will route the record to a RecordBatcher based on this signature.
>  # The RecordBatcher is responsible for collecting all the compatible AIR records and, upon request, the "batcher" is able to build an Arrow Record representing a batch of compatible inputs. In the current implementation, the batcher is able to convert string columns to dictionary based on a configuration. Another configuration allows to evaluate which columns should be sorted to optimize the compression ratio. The same optimization process could be applied to binary columns.
>  # Steps 1 through 3 can be repeated on the same RecordRepository instance to build new sets of arrow record batches. Subsequent iterations will be slightly faster due to different techniques used (e.g. object reuse, dictionary reuse and sorting, ...)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)