You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/03 23:38:00 UTC

[jira] [Commented] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

    [ https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602492#comment-16602492 ] 

Wes McKinney commented on ARROW-3164:
-------------------------------------

Sounds like a useful initiative. We're already developed some rows-to-columns functionality in C++ and would be great to expand beyond what we have now, particular around creating neatly-sized record batches. It would be useful to be able to quickly convert to Protobuf or Avro-encoded row data, and back. 

One minor point though:

> Arrow evolved from Apache Drill. 

This isn't quite accurate. Java code from Apache Drill formed the basis for the initial Java codebase in Apache Arrow. I wouldn't say that the project evolved from Apache Drill itself. The project was created by a confluence of open source projects wishing to define an open standard for in-memory columnar data as its first project, with the broader goal of creating reusable libraries for creating database-like systems ("the deconstructed database" we have been calling it). It happened to be that Drill's ValueVectors were already very close to the fully-shredded columnar model that the community desired, and provided a good starting point. The scope of the project has evolved significantly in the meantime.

> [Java] Port Row Set abstraction from Drill to Arrow
> ---------------------------------------------------
>
>                 Key: ARROW-3164
>                 URL: https://issues.apache.org/jira/browse/ARROW-3164
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Paul Rogers
>            Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the process, however, data must be load into, and read out of the Arrow vectors.
> Arrow evolved from Apache Drill. The Drill project created a "Row Set" abstraction that:
> * Provides a simple way to define the schema for a set of batches.
> * Loads data into vectors from row-oriented inputs.
> * Reads data out of vectors in row-oriented output.
> * Controls memory consumed by the record batch when loading data into vectors.
> * Ensures maximum usage of the allocated vector space when loading data Into vectors.
> * Optionally handles projection when reading data from an input file into a set of vectors.
> * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow vectors.
> Detailed information is available in [this wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed description of the motivation for this project, and an analysis of the work required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same mechanisms can be ported to other languages if useful. Those ports would be separate future projects.
> The code will be placed in a new Java module which can be imported by projects that wish to use the code. Changes may be needed to expose items from the {{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)