You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/04/02 19:07:00 UTC
Re: [DISCUSS] Format changes: process and requirements

I created

https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone

some time ago to try to track the status of different implementations
and various in-flight discussions about columnar format evolution. Can
some others take a look at that and perhaps update some sections?

I agree with having at least 2 complete implementations and so we have
a good amount of implementation shortfall (e.g. delta dictionaries) to
address already.

On Mon, Mar 18, 2019 at 12:51 AM Paul Taylor <pt...@apache.org> wrote:
>
> Hi Jacques,
>
> > I think we should have two complete implementations. I don't think having
> > one feature in C# and Go and another in JavaScript and Rust does justice to
> > the project goals.
>
> Agree 100%. We may already be in this situation with the DictionaryBatch
> "isDelta" flag. I haven't checked the C++ in a while so I may be
> mistaken, but I think JS is the only impl with support for interleaved
> Dictionary/RecordBatches. It'd be good to put a process in place that
> helps avoid this in the future.
>
> > I think Java and C++ should always be complete. They are
> > the first two implementations. I believe they are the most complete and
> > broadly used/popular (C++ given Python & Pandas integration and Java via
> > Spark & Dremio).
> No argument here either, though I should mention with the exception of
> Tensor messages the JS version is also feature-complete from the
> standpoint of the format.
>
> It's still early in terms of adoption, but we've seen some interest from
> the Vega, Jupyter, and Uber Deck.gl projects in either contributing to
> or integrating with ArrowJS.
>
> So while we're certainly not at the level of Spark or Pandas, we may be
> poised for wider adoption, and I'd request we take the JS implementation
> into account when making format changes. I'm happy to implement new
> features and update the integration tests as necessary.
>
> > Are there specific changes to format/ that have been merged that you
> > are concerned about that you feel need to be discussed separately?
> The thing that springs to mind is anything to do with 64-bit indexing,
> as recently discussed in the sparse matrix thread. IIRC none of the JS
> engines presently allow allocating buffers greater than 2GiB.
> Limitations in JS shouldn't block other implementations from moving
> ahead, but it would be good for the community to come to a consensus on
> guidance or workarounds for JS interop when we are in that sort of
> situation.
>
> Thanks,
>
> Paul
>
>
> On 3/17/19 6:07 PM, Jacques Nadeau wrote:
> >> How about "at least two native implementations" instead of
> >> "Java and C++"? Now, we have multiple native
> >> implementations:
> >>
> > I think we should have two complete implementations. I don't think having
> > one feature in C# and Go and another in JavaScript and Rust does justice to
> > the project goals. I think Java and C++ should always be complete. They are
> > the first two implementations. I believe they are the most complete and
> > broadly used/popular (C++ given Python & Pandas integration and Java via
> > Spark & Dremio). This is a compromise between setting a high barrier for
> > creation of new features and making sure that we have validated things
> > across impls.
> >
> > Are there specific changes to format/ that have been merged that you
> > are concerned about that you feel need to be discussed separately?
> > There have been some changes related to serializing tensor metadata
> > that are clearly marked as experimental, and they also do not interact
> > with the columnar format.
> >
> > There are several things we've introduced over time that suffered this
> > problem. Alignment changes, dictionary encoding, union behavior, interval
> > behavior, tensors, unsigned integrations, etc that we've failed to make
> > sure we have integration tests for. I've meant to send this email for
> > months but saw a couple of recent proposed changes which made me feel like
> > we should discuss further.
> >