You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Neville Dipale <ne...@gmail.com> on 2019/01/25 18:48:21 UTC

[Format] [Rust] ChunkedArray, Column and Table

Hi Arrow developers,

I've been looking at the various language impls, and although a Table isn't
currently part of the spec, it seems to be implemented in CPP, Python, Go,
JS (and perhaps other languages).

Are there plans of standardising these and adding them to the spec?

I'm asking because I'm working on a dataframe implementation for Rust (
https://github.com/nevi-me/rust-dataframe), and I've started trying to
implement columns and tables with the intention to upstream them if I get
them right.

Regards
Neville

Re: [Format] [Rust] ChunkedArray, Column and Table

Posted by Sebastien Binet <bi...@cern.ch>.
On Sun, Jan 27, 2019 at 1:08 PM Neville Dipale <ne...@gmail.com>
wrote:

> Hi Antoine,
>
> I've given your response some thought.
>
> I'm thinking more looking at the computational aspect of Arrow. I agree
> that for representing and sharing data, RecordBatches achieve the purpose.
>
> I came across ChunkedArray, Column and Table while I was trying to create a
> dataframe library in Rust. The other languages already benefit from these 3
> already implemented, but for Rust I've had to try create them myself.
> This is what led me to asking the question, because the various languages
> that I've seen so far, seem to follow the same kind of standard re. both
> the structure and methods to create/interact with chunked arrays, columns,
> and tables.
>
> [1] Go Tables:
> https://github.com/apache/arrow/blob/master/go/arrow/array/table.go


there's also this WIP dataframe package being built on top of Arrow:
-  https://github.com/gonum/exp/pull/19

-s


> [2] CPP Tables:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.cc
> [3] JS Tables: https://github.com/apache/arrow/blob/master/js/src/table.ts
> [4] Ruby:
>
> https://github.com/apache/arrow/blob/master/ruby/red-arrow/lib/arrow/table.rb
> [5] Python, pyarrow.Table
>
> While going through the source, I didn't find anything for Java, and that's
> swayed me to think that maybe Tables don't need standardising as each
> implementation would likely implement them differently (or not implement
> them).
>
> Regards
> Neville
>
> On Fri, 25 Jan 2019 at 20:56, Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Hello Neville,
> >
> > I don't know if Tables need standardizing.  Record Batches are part of
> > the spec (*), and they are the basic block for exchanging and sharing
> > tabular data.  Depending on your application, you might exchange a
> > stream of Record Batches, or a fixed-length sequence thereof (in which
> > case you have a "Table").
> >
> > (*) see https://arrow.apache.org/docs/metadata.html
> >
> > (reading that spec though, it's not obvious to me why the Record Batch
> > definition doesn't reference a Schema)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 25/01/2019 à 19:48, Neville Dipale a écrit :
> > > Hi Arrow developers,
> > >
> > > I've been looking at the various language impls, and although a Table
> > isn't
> > > currently part of the spec, it seems to be implemented in CPP, Python,
> > Go,
> > > JS (and perhaps other languages).
> > >
> > > Are there plans of standardising these and adding them to the spec?
> > >
> > > I'm asking because I'm working on a dataframe implementation for Rust (
> > > https://github.com/nevi-me/rust-dataframe), and I've started trying to
> > > implement columns and tables with the intention to upstream them if I
> get
> > > them right.
> > >
> > > Regards
> > > Neville
> > >
> >
>

Re: [Format] [Rust] ChunkedArray, Column and Table

Posted by Wes McKinney <we...@gmail.com>.
Just to add my two cents:

The Arrow specification and Flatbuffers files defines a _binary
protocol_ for making data available at the contiguous record batch
level either in-process or via some other address space (a memory
mapped file, a socket payload / RPC message).

Chunked arrays and tables are semantic constructs and don't really
have much to do with the binary protocol. It has turned out to be a
convenient programming construct, so I don't necessarily think it's a
bad idea for e.g. Go, Rust, or JavaScript to copy these ideas. There
is no requirement to do this, though; these were just some ideas I had
about how to make working with in-memory datasets consisting of
multiple record batches a bit nicer. There may be some other
interfaces or abstractions created in the future in one of the other
languages that we could adopt later in C++.

BTW, what we do in C++ if we have an arrow::Table whose columns have
different chunking layouts is split up the table into a sequence of
regularized record batches (see [1]); these could then be put on the
wire (e.g. using Flight / gRPC) or written to a shared memory segment
using the IPC stream or file protocol

- Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L302

On Sun, Jan 27, 2019 at 9:46 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi Neville,
>
> Le 27/01/2019 à 13:07, Neville Dipale a écrit :
> > Hi Antoine,
> >
> > I've given your response some thought.
> >
> > I'm thinking more looking at the computational aspect of Arrow. I agree
> > that for representing and sharing data, RecordBatches achieve the purpose.
> >
> > I came across ChunkedArray, Column and Table while I was trying to create a
> > dataframe library in Rust. The other languages already benefit from these 3
> > already implemented, but for Rust I've had to try create them myself.
> > This is what led me to asking the question, because the various languages
> > that I've seen so far, seem to follow the same kind of standard re. both
> > the structure and methods to create/interact with chunked arrays, columns,
> > and tables.
>
> What happened is probably that most non-C++ implementations took
> inspiration from the C++ implementation ;-)
>
> Arrow does not aim at standardizing APIs, just data structures.
> Personally (i.e. I do not claim to represent the views of the project
> here), it seems to me that standardizing APIs leads to suboptimal and
> cumbersome "largest common denominator" interfaces such as the DOM APIs
> for XML.
>
> Regards
>
> Antoine.

Re: [Format] [Rust] ChunkedArray, Column and Table

Posted by Antoine Pitrou <an...@python.org>.
Hi Neville,

Le 27/01/2019 à 13:07, Neville Dipale a écrit :
> Hi Antoine,
> 
> I've given your response some thought.
> 
> I'm thinking more looking at the computational aspect of Arrow. I agree
> that for representing and sharing data, RecordBatches achieve the purpose.
> 
> I came across ChunkedArray, Column and Table while I was trying to create a
> dataframe library in Rust. The other languages already benefit from these 3
> already implemented, but for Rust I've had to try create them myself.
> This is what led me to asking the question, because the various languages
> that I've seen so far, seem to follow the same kind of standard re. both
> the structure and methods to create/interact with chunked arrays, columns,
> and tables.

What happened is probably that most non-C++ implementations took
inspiration from the C++ implementation ;-)

Arrow does not aim at standardizing APIs, just data structures.
Personally (i.e. I do not claim to represent the views of the project
here), it seems to me that standardizing APIs leads to suboptimal and
cumbersome "largest common denominator" interfaces such as the DOM APIs
for XML.

Regards

Antoine.

Re: [Format] [Rust] ChunkedArray, Column and Table

Posted by Neville Dipale <ne...@gmail.com>.
Hi Antoine,

I've given your response some thought.

I'm thinking more looking at the computational aspect of Arrow. I agree
that for representing and sharing data, RecordBatches achieve the purpose.

I came across ChunkedArray, Column and Table while I was trying to create a
dataframe library in Rust. The other languages already benefit from these 3
already implemented, but for Rust I've had to try create them myself.
This is what led me to asking the question, because the various languages
that I've seen so far, seem to follow the same kind of standard re. both
the structure and methods to create/interact with chunked arrays, columns,
and tables.

[1] Go Tables:
https://github.com/apache/arrow/blob/master/go/arrow/array/table.go
[2] CPP Tables:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.cc
[3] JS Tables: https://github.com/apache/arrow/blob/master/js/src/table.ts
[4] Ruby:
https://github.com/apache/arrow/blob/master/ruby/red-arrow/lib/arrow/table.rb
[5] Python, pyarrow.Table

While going through the source, I didn't find anything for Java, and that's
swayed me to think that maybe Tables don't need standardising as each
implementation would likely implement them differently (or not implement
them).

Regards
Neville

On Fri, 25 Jan 2019 at 20:56, Antoine Pitrou <an...@python.org> wrote:

>
> Hello Neville,
>
> I don't know if Tables need standardizing.  Record Batches are part of
> the spec (*), and they are the basic block for exchanging and sharing
> tabular data.  Depending on your application, you might exchange a
> stream of Record Batches, or a fixed-length sequence thereof (in which
> case you have a "Table").
>
> (*) see https://arrow.apache.org/docs/metadata.html
>
> (reading that spec though, it's not obvious to me why the Record Batch
> definition doesn't reference a Schema)
>
> Regards
>
> Antoine.
>
>
> Le 25/01/2019 à 19:48, Neville Dipale a écrit :
> > Hi Arrow developers,
> >
> > I've been looking at the various language impls, and although a Table
> isn't
> > currently part of the spec, it seems to be implemented in CPP, Python,
> Go,
> > JS (and perhaps other languages).
> >
> > Are there plans of standardising these and adding them to the spec?
> >
> > I'm asking because I'm working on a dataframe implementation for Rust (
> > https://github.com/nevi-me/rust-dataframe), and I've started trying to
> > implement columns and tables with the intention to upstream them if I get
> > them right.
> >
> > Regards
> > Neville
> >
>

Re: [Format] [Rust] ChunkedArray, Column and Table

Posted by Antoine Pitrou <an...@python.org>.
Hello Neville,

I don't know if Tables need standardizing.  Record Batches are part of
the spec (*), and they are the basic block for exchanging and sharing
tabular data.  Depending on your application, you might exchange a
stream of Record Batches, or a fixed-length sequence thereof (in which
case you have a "Table").

(*) see https://arrow.apache.org/docs/metadata.html

(reading that spec though, it's not obvious to me why the Record Batch
definition doesn't reference a Schema)

Regards

Antoine.


Le 25/01/2019 à 19:48, Neville Dipale a écrit :
> Hi Arrow developers,
> 
> I've been looking at the various language impls, and although a Table isn't
> currently part of the spec, it seems to be implemented in CPP, Python, Go,
> JS (and perhaps other languages).
> 
> Are there plans of standardising these and adding them to the spec?
> 
> I'm asking because I'm working on a dataframe implementation for Rust (
> https://github.com/nevi-me/rust-dataframe), and I've started trying to
> implement columns and tables with the intention to upstream them if I get
> them right.
> 
> Regards
> Neville
>