You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2017/07/24 16:47:59 UTC

[DISCUSS] The road from Arrow 0.5.0 to 1.0.0

hi folks,

In recent discussions, since the Arrow memory format and metadata has
become reasonably stabilized, and we're more likely to add new data
types than change existing ones, we may consider making a 1.0.0 to
declare to the rest of the open source world that "Arrow is open for
business" and can be relied upon in production applications (which
some reasonable tolerance for library API changes from major release
to major release). I hope we can all agree that forward and backward
compatibility in the zero-copy wire format and metadata is the most
essential thing.

To that end, I'd like to collect ideas for what needs to be
accomplished in the project before we'd be comfortable making a 1.0.0
release. I think it would be a good show of project stability /
production-readiness to do this (with the caveat the APIs will
continue to evolve).

The main things on my end are hardening the memory format and
integration tests for the remaining data types:

- Decimals
    - Lingering issues with 128-bit decimals
    - Need integration tests
  - Fixed size list
    - Java has implemented, but not C++. Need integration tests
  - Union
    - Two kinds of unions, Java only implements one. Need integration tests

On these, Decimals have the most work since the memory format needs to
be specified. On Unions, we may decide to not implement the dense
variant and focus on integration testing the sparse variant. I don't
think this is going to be too much work, but it needs to get sorted
out so we don't have incomplete or under-tested parts of the
specification.

There's some other things being discussed, like a Map logical type,
but that (at least as currently proposed) won't require any disruptive
modifications to the metadata.

As far as the metadata and memory format, we would use the Open/Closed
principle to guide our efforts
(https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
would be possible to add compression or encoding at the field level
without disrupting earlier versions of the software that lack these
features.

In the event that we do need to change the metadata or memory format
in the future (which would probably be an extreme circumstance), we
have the option of increasing the MetadataVersion which is one of the
first tags accompanying Arrow messages
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
So if you encounter a message that you do not support, you can raise
an appropriate exception.

There are some other things that would be nice to prototype or
specify, like a REST protocol for exposing Arrow datasets in a
client-server model (sending Arrow record batches via REST HTTP
calls).

Anything else that would need to go to move to a 1.x mainline for
development? One idea would be if we need to make any breaking changes
that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
maintenance mode.

Thanks
Wes

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Leif Walsh <le...@gmail.com>.
I think Wes' idea that major versions indicate stability of the spec and
minor versions indicate stability of each implementation's API makes sense.
With that in mind, maybe before 1.0 of the spec we should just establish,
within each of the reference language implementations, a mechanism for
specifying in which minor version an API feature was introduced or
deprecated, and also within the reference implementations, ensure that they
have appropriate mechanisms (and tests) to deal with future spec (major)
versions. That is, the Java and C++ implementations at least should check
that when arrow spec 2.0 comes out, they'll fail transparently, fast, and
gracefully when possible when reading data written with spec 2.0.
On Thu, Jul 27, 2017 at 13:03 Julian Hyde <jh...@apache.org> wrote:

> Semantic versioning is a great tool, and we should use it as far as it
> goes, but not push it.
>
> I suggest that the Arrow specification should have a paragraph that
> states the level of maturity of each part of the API; and each
> implementation should have a paragraph that states which parts of the
> spec are implemented, and to what quality. A lot can be accomplished
> in one paragraph in terms of setting people's expectations.
>
> And since you mentioned the open-closed principle earlier, the
> robustness principle [1] should apply: be liberal in what you accept,
> conservative in what you do. An arrow library should (ideally) not
> fall over if it encounters a data structure that was experimental in a
> previous version and has recently been removed.
>
> Julian
>
> [1] https://en.wikipedia.org/wiki/Robustness_principle
>
>
> On Wed, Jul 26, 2017 at 12:30 PM, Wes McKinney <we...@gmail.com>
> wrote:
> > The combinatorics of code-level API stability are worrisome (with
> > already 5 different language APIs in the project) while the maturity
> > and development pace of different implementations may remain variable
> > for some time.
> >
> > There are two possible things we can communicate with some form of
> > major version number:
> >
> > * The Arrow specification (independent to implementation) is complete,
> > with more than one reference implementations proving to have
> > implemented it
> >
> > * The code is complete and stable
> >
> > The latter seems undesirable, at least on a 6 month horizon. I don't
> > think it should keep us from making a public statement that we've
> > hardened the Arrow format itself. Perhaps we need two kinds of major
> > versions for the project.
> >
> > The worry I have is that strict semantic versioning might prove
> > onerous to immature implementations. As a concrete example, suppose
> > that someone starts a Go implementation shortly after we've made a 1.0
> > release with integration tests for all the well-specified Arrow types.
> > After a couple of months, the Go developers need to make some breaking
> > API changes. Does that mean we need to bump the whole project to 2.x?
> > As more languages come into the fold, this could happen more and more
> > often. How would people interpret a fast escalating major version
> > number?
> >
> > I am curious how Avro or Thrift have addressed this issue.
> >
> > - Wes
> >
> > On Wed, Jul 26, 2017 at 3:13 PM, Julian Hyde <jh...@apache.org> wrote:
> >> I agree with all that. But semantic versioning only pertains to public
> APIs. So, for it to work, you need to declare what are your public APIs. If
> you don’t, people will make assumptions about what are your public APIs,
> and they may get it wrong.
> >>
> >> The ability to add experimental APIs (not subject to semantic
> versioning until they are officially declared public) will help the project
> evolve and stay relevant.
> >>
> >> Julian
> >>
> >>
> >>> On Jul 26, 2017, at 12:02 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>>
> >>> I see the semantic versioning like this:
> >>>
> >>> Major version: Format and Metadata stability
> >>> Minor version: API stability within fix versions
> >>> Fix version: Bug fixes
> >>>
> >>> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
> >>> make a breaking change to the memory format without increasing the
> >>> major version. We also have the added protection of a version enum in
> >>> the metadata
> >>>
> >>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> >>>
> >>> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>>> Given the nature of the Arrow project, where any number of different
> >>>> implementations will be in flux at any given time, claiming any sort
> >>>> of API stability at the code level across the whole project seems
> >>>> impossible any time soon.
> >>>>
> >>>> The important commitment of a 1.0 release is that the metadata and
> >>>> memory format is not changing (without a change in the major version
> >>>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
> >>>> memory format and serialized metadata representation. That is, the
> >>>> files in
> >>>>
> >>>> https://github.com/apache/arrow/tree/master/format
> >>>>
> >>>> Having this kind of stability is really important so that if any
> >>>> systems know how to parse or emit Arrow 1.x data, but aren't
> >>>> necessarily using the libraries provided by the project, they can have
> >>>> some assurance that we aren't going to break the Flatbuffers or the
> >>>> arrangement of bytes in a record batch on the wire. If that makes
> >>>> sense.
> >>>>
> >>>> - Wes
> >>>>
> >>>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org>
> wrote:
> >>>>> 1.0 is a Big Deal because, under semantic versioning, there is a
> commitment to not change public APIs. If it weren’t for that, 1.0 would
> have vague marketing connotations of robustness, adoption etc. but
> otherwise be no different from another release.
> >>>>>
> >>>>> So, if API and data format lifecycle and compatibility is the goal
> here, would it be useful to introduce explicit flags on API maturity? Call
> out which APIs are public, and therefore bound by the semantic versioning
> contract. This will also give Arrow some room to add experimental features
> after 1.0, and avoid calcification.
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
> >>>>>> integration testing remaining data types. We are so close to having
> >>>>>> everything tested and stable, we should push to complete these as
> soon
> >>>>>> as possible (save for Map, which has only just been added to the
> >>>>>> metadata)
> >>>>>>
> >>>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>>> I agree those things would be nice to have. Hardening the memory
> >>>>>>> format details probably would not take longer than a month or so
> if we
> >>>>>>> were to focus in on it.
> >>>>>>>
> >>>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or
> will
> >>>>>>> require a design period and then initial implementation. I think
> >>>>>>> having the streaming format implementations is a good start, but
> the
> >>>>>>> streams are a bit monothic -- e.g. in REST you might want to
> request
> >>>>>>> metadata only, or only record batches given a known schema. We
> should
> >>>>>>> create a proposal document (Google docs?) for the community to
> comment
> >>>>>>> where we can iterate on requirements
> >>>>>>>
> >>>>>>> Separately, I'm interested in embedding Arrow streams in other
> >>>>>>> transport layers, like GRPC. The recent refactoring in C++ to make
> the
> >>>>>>> streams less monolithic was intended to help with that.
> >>>>>>>
> >>>>>>> - Wes
> >>>>>>>
> >>>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <
> jacques@apache.org> wrote:
> >>>>>>>> Top things on my list:
> >>>>>>>>
> >>>>>>>> - Formalize Arrow RPC and/or REST
> >>>>>>>> - Some reference transformation algorithms
> >>>>>>>> - Prototype IPC
> >>>>>>>>
> >>>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <
> wesmckinn@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> hi folks,
> >>>>>>>>>
> >>>>>>>>> In recent discussions, since the Arrow memory format and
> metadata has
> >>>>>>>>> become reasonably stabilized, and we're more likely to add new
> data
> >>>>>>>>> types than change existing ones, we may consider making a 1.0.0
> to
> >>>>>>>>> declare to the rest of the open source world that "Arrow is open
> for
> >>>>>>>>> business" and can be relied upon in production applications
> (which
> >>>>>>>>> some reasonable tolerance for library API changes from major
> release
> >>>>>>>>> to major release). I hope we can all agree that forward and
> backward
> >>>>>>>>> compatibility in the zero-copy wire format and metadata is the
> most
> >>>>>>>>> essential thing.
> >>>>>>>>>
> >>>>>>>>> To that end, I'd like to collect ideas for what needs to be
> >>>>>>>>> accomplished in the project before we'd be comfortable making a
> 1.0.0
> >>>>>>>>> release. I think it would be a good show of project stability /
> >>>>>>>>> production-readiness to do this (with the caveat the APIs will
> >>>>>>>>> continue to evolve).
> >>>>>>>>>
> >>>>>>>>> The main things on my end are hardening the memory format and
> >>>>>>>>> integration tests for the remaining data types:
> >>>>>>>>>
> >>>>>>>>> - Decimals
> >>>>>>>>>   - Lingering issues with 128-bit decimals
> >>>>>>>>>   - Need integration tests
> >>>>>>>>> - Fixed size list
> >>>>>>>>>   - Java has implemented, but not C++. Need integration tests
> >>>>>>>>> - Union
> >>>>>>>>>   - Two kinds of unions, Java only implements one. Need
> integration tests
> >>>>>>>>>
> >>>>>>>>> On these, Decimals have the most work since the memory format
> needs to
> >>>>>>>>> be specified. On Unions, we may decide to not implement the dense
> >>>>>>>>> variant and focus on integration testing the sparse variant. I
> don't
> >>>>>>>>> think this is going to be too much work, but it needs to get
> sorted
> >>>>>>>>> out so we don't have incomplete or under-tested parts of the
> >>>>>>>>> specification.
> >>>>>>>>>
> >>>>>>>>> There's some other things being discussed, like a Map logical
> type,
> >>>>>>>>> but that (at least as currently proposed) won't require any
> disruptive
> >>>>>>>>> modifications to the metadata.
> >>>>>>>>>
> >>>>>>>>> As far as the metadata and memory format, we would use the
> Open/Closed
> >>>>>>>>> principle to guide our efforts
> >>>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For
> example, it
> >>>>>>>>> would be possible to add compression or encoding at the field
> level
> >>>>>>>>> without disrupting earlier versions of the software that lack
> these
> >>>>>>>>> features.
> >>>>>>>>>
> >>>>>>>>> In the event that we do need to change the metadata or memory
> format
> >>>>>>>>> in the future (which would probably be an extreme circumstance),
> we
> >>>>>>>>> have the option of increasing the MetadataVersion which is one
> of the
> >>>>>>>>> first tags accompanying Arrow messages
> >>>>>>>>> (
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
> >>>>>>>>> So if you encounter a message that you do not support, you can
> raise
> >>>>>>>>> an appropriate exception.
> >>>>>>>>>
> >>>>>>>>> There are some other things that would be nice to prototype or
> >>>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
> >>>>>>>>> client-server model (sending Arrow record batches via REST HTTP
> >>>>>>>>> calls).
> >>>>>>>>>
> >>>>>>>>> Anything else that would need to go to move to a 1.x mainline for
> >>>>>>>>> development? One idea would be if we need to make any breaking
> changes
> >>>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches
> into
> >>>>>>>>> maintenance mode.
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Wes
> >>>>>>>>>
> >>>>>
> >>
>
-- 
-- 
Cheers,
Leif

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Julian Hyde <jh...@apache.org>.
Semantic versioning is a great tool, and we should use it as far as it
goes, but not push it.

I suggest that the Arrow specification should have a paragraph that
states the level of maturity of each part of the API; and each
implementation should have a paragraph that states which parts of the
spec are implemented, and to what quality. A lot can be accomplished
in one paragraph in terms of setting people's expectations.

And since you mentioned the open-closed principle earlier, the
robustness principle [1] should apply: be liberal in what you accept,
conservative in what you do. An arrow library should (ideally) not
fall over if it encounters a data structure that was experimental in a
previous version and has recently been removed.

Julian

[1] https://en.wikipedia.org/wiki/Robustness_principle


On Wed, Jul 26, 2017 at 12:30 PM, Wes McKinney <we...@gmail.com> wrote:
> The combinatorics of code-level API stability are worrisome (with
> already 5 different language APIs in the project) while the maturity
> and development pace of different implementations may remain variable
> for some time.
>
> There are two possible things we can communicate with some form of
> major version number:
>
> * The Arrow specification (independent to implementation) is complete,
> with more than one reference implementations proving to have
> implemented it
>
> * The code is complete and stable
>
> The latter seems undesirable, at least on a 6 month horizon. I don't
> think it should keep us from making a public statement that we've
> hardened the Arrow format itself. Perhaps we need two kinds of major
> versions for the project.
>
> The worry I have is that strict semantic versioning might prove
> onerous to immature implementations. As a concrete example, suppose
> that someone starts a Go implementation shortly after we've made a 1.0
> release with integration tests for all the well-specified Arrow types.
> After a couple of months, the Go developers need to make some breaking
> API changes. Does that mean we need to bump the whole project to 2.x?
> As more languages come into the fold, this could happen more and more
> often. How would people interpret a fast escalating major version
> number?
>
> I am curious how Avro or Thrift have addressed this issue.
>
> - Wes
>
> On Wed, Jul 26, 2017 at 3:13 PM, Julian Hyde <jh...@apache.org> wrote:
>> I agree with all that. But semantic versioning only pertains to public APIs. So, for it to work, you need to declare what are your public APIs. If you don’t, people will make assumptions about what are your public APIs, and they may get it wrong.
>>
>> The ability to add experimental APIs (not subject to semantic versioning until they are officially declared public) will help the project evolve and stay relevant.
>>
>> Julian
>>
>>
>>> On Jul 26, 2017, at 12:02 PM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I see the semantic versioning like this:
>>>
>>> Major version: Format and Metadata stability
>>> Minor version: API stability within fix versions
>>> Fix version: Bug fixes
>>>
>>> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
>>> make a breaking change to the memory format without increasing the
>>> major version. We also have the added protection of a version enum in
>>> the metadata
>>>
>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
>>>
>>> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <we...@gmail.com> wrote:
>>>> Given the nature of the Arrow project, where any number of different
>>>> implementations will be in flux at any given time, claiming any sort
>>>> of API stability at the code level across the whole project seems
>>>> impossible any time soon.
>>>>
>>>> The important commitment of a 1.0 release is that the metadata and
>>>> memory format is not changing (without a change in the major version
>>>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>>>> memory format and serialized metadata representation. That is, the
>>>> files in
>>>>
>>>> https://github.com/apache/arrow/tree/master/format
>>>>
>>>> Having this kind of stability is really important so that if any
>>>> systems know how to parse or emit Arrow 1.x data, but aren't
>>>> necessarily using the libraries provided by the project, they can have
>>>> some assurance that we aren't going to break the Flatbuffers or the
>>>> arrangement of bytes in a record batch on the wire. If that makes
>>>> sense.
>>>>
>>>> - Wes
>>>>
>>>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>>>>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>>>>>
>>>>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>>>>>
>>>>> Julian
>>>>>
>>>>>
>>>>>
>>>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>
>>>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>>>> integration testing remaining data types. We are so close to having
>>>>>> everything tested and stable, we should push to complete these as soon
>>>>>> as possible (save for Map, which has only just been added to the
>>>>>> metadata)
>>>>>>
>>>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>> I agree those things would be nice to have. Hardening the memory
>>>>>>> format details probably would not take longer than a month or so if we
>>>>>>> were to focus in on it.
>>>>>>>
>>>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>>>> require a design period and then initial implementation. I think
>>>>>>> having the streaming format implementations is a good start, but the
>>>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>>>> metadata only, or only record batches given a known schema. We should
>>>>>>> create a proposal document (Google docs?) for the community to comment
>>>>>>> where we can iterate on requirements
>>>>>>>
>>>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>>>> streams less monolithic was intended to help with that.
>>>>>>>
>>>>>>> - Wes
>>>>>>>
>>>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>>>>> Top things on my list:
>>>>>>>>
>>>>>>>> - Formalize Arrow RPC and/or REST
>>>>>>>> - Some reference transformation algorithms
>>>>>>>> - Prototype IPC
>>>>>>>>
>>>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> hi folks,
>>>>>>>>>
>>>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>>>> business" and can be relied upon in production applications (which
>>>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>>>> essential thing.
>>>>>>>>>
>>>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>>>> release. I think it would be a good show of project stability /
>>>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>>>> continue to evolve).
>>>>>>>>>
>>>>>>>>> The main things on my end are hardening the memory format and
>>>>>>>>> integration tests for the remaining data types:
>>>>>>>>>
>>>>>>>>> - Decimals
>>>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>>>   - Need integration tests
>>>>>>>>> - Fixed size list
>>>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>>>> - Union
>>>>>>>>>   - Two kinds of unions, Java only implements one. Need integration tests
>>>>>>>>>
>>>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>>>> specification.
>>>>>>>>>
>>>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>>>> modifications to the metadata.
>>>>>>>>>
>>>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>>>> principle to guide our efforts
>>>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>>>> features.
>>>>>>>>>
>>>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>>>> first tags accompanying Arrow messages
>>>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>>>> an appropriate exception.
>>>>>>>>>
>>>>>>>>> There are some other things that would be nice to prototype or
>>>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>>>> calls).
>>>>>>>>>
>>>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>>>> maintenance mode.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wes
>>>>>>>>>
>>>>>
>>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
The combinatorics of code-level API stability are worrisome (with
already 5 different language APIs in the project) while the maturity
and development pace of different implementations may remain variable
for some time.

There are two possible things we can communicate with some form of
major version number:

* The Arrow specification (independent to implementation) is complete,
with more than one reference implementations proving to have
implemented it

* The code is complete and stable

The latter seems undesirable, at least on a 6 month horizon. I don't
think it should keep us from making a public statement that we've
hardened the Arrow format itself. Perhaps we need two kinds of major
versions for the project.

The worry I have is that strict semantic versioning might prove
onerous to immature implementations. As a concrete example, suppose
that someone starts a Go implementation shortly after we've made a 1.0
release with integration tests for all the well-specified Arrow types.
After a couple of months, the Go developers need to make some breaking
API changes. Does that mean we need to bump the whole project to 2.x?
As more languages come into the fold, this could happen more and more
often. How would people interpret a fast escalating major version
number?

I am curious how Avro or Thrift have addressed this issue.

- Wes

On Wed, Jul 26, 2017 at 3:13 PM, Julian Hyde <jh...@apache.org> wrote:
> I agree with all that. But semantic versioning only pertains to public APIs. So, for it to work, you need to declare what are your public APIs. If you don’t, people will make assumptions about what are your public APIs, and they may get it wrong.
>
> The ability to add experimental APIs (not subject to semantic versioning until they are officially declared public) will help the project evolve and stay relevant.
>
> Julian
>
>
>> On Jul 26, 2017, at 12:02 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> I see the semantic versioning like this:
>>
>> Major version: Format and Metadata stability
>> Minor version: API stability within fix versions
>> Fix version: Bug fixes
>>
>> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
>> make a breaking change to the memory format without increasing the
>> major version. We also have the added protection of a version enum in
>> the metadata
>>
>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
>>
>> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <we...@gmail.com> wrote:
>>> Given the nature of the Arrow project, where any number of different
>>> implementations will be in flux at any given time, claiming any sort
>>> of API stability at the code level across the whole project seems
>>> impossible any time soon.
>>>
>>> The important commitment of a 1.0 release is that the metadata and
>>> memory format is not changing (without a change in the major version
>>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>>> memory format and serialized metadata representation. That is, the
>>> files in
>>>
>>> https://github.com/apache/arrow/tree/master/format
>>>
>>> Having this kind of stability is really important so that if any
>>> systems know how to parse or emit Arrow 1.x data, but aren't
>>> necessarily using the libraries provided by the project, they can have
>>> some assurance that we aren't going to break the Flatbuffers or the
>>> arrangement of bytes in a record batch on the wire. If that makes
>>> sense.
>>>
>>> - Wes
>>>
>>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>>>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>>>>
>>>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>>>>
>>>> Julian
>>>>
>>>>
>>>>
>>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>>> integration testing remaining data types. We are so close to having
>>>>> everything tested and stable, we should push to complete these as soon
>>>>> as possible (save for Map, which has only just been added to the
>>>>> metadata)
>>>>>
>>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>>> I agree those things would be nice to have. Hardening the memory
>>>>>> format details probably would not take longer than a month or so if we
>>>>>> were to focus in on it.
>>>>>>
>>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>>> require a design period and then initial implementation. I think
>>>>>> having the streaming format implementations is a good start, but the
>>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>>> metadata only, or only record batches given a known schema. We should
>>>>>> create a proposal document (Google docs?) for the community to comment
>>>>>> where we can iterate on requirements
>>>>>>
>>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>>> streams less monolithic was intended to help with that.
>>>>>>
>>>>>> - Wes
>>>>>>
>>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>>>> Top things on my list:
>>>>>>>
>>>>>>> - Formalize Arrow RPC and/or REST
>>>>>>> - Some reference transformation algorithms
>>>>>>> - Prototype IPC
>>>>>>>
>>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>>
>>>>>>>> hi folks,
>>>>>>>>
>>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>>> business" and can be relied upon in production applications (which
>>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>>> essential thing.
>>>>>>>>
>>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>>> release. I think it would be a good show of project stability /
>>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>>> continue to evolve).
>>>>>>>>
>>>>>>>> The main things on my end are hardening the memory format and
>>>>>>>> integration tests for the remaining data types:
>>>>>>>>
>>>>>>>> - Decimals
>>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>>   - Need integration tests
>>>>>>>> - Fixed size list
>>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>>> - Union
>>>>>>>>   - Two kinds of unions, Java only implements one. Need integration tests
>>>>>>>>
>>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>>> specification.
>>>>>>>>
>>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>>> modifications to the metadata.
>>>>>>>>
>>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>>> principle to guide our efforts
>>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>>> features.
>>>>>>>>
>>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>>> first tags accompanying Arrow messages
>>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>>> an appropriate exception.
>>>>>>>>
>>>>>>>> There are some other things that would be nice to prototype or
>>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>>> calls).
>>>>>>>>
>>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>>> maintenance mode.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wes
>>>>>>>>
>>>>
>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Julian Hyde <jh...@apache.org>.
I agree with all that. But semantic versioning only pertains to public APIs. So, for it to work, you need to declare what are your public APIs. If you don’t, people will make assumptions about what are your public APIs, and they may get it wrong.

The ability to add experimental APIs (not subject to semantic versioning until they are officially declared public) will help the project evolve and stay relevant.

Julian


> On Jul 26, 2017, at 12:02 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> I see the semantic versioning like this:
> 
> Major version: Format and Metadata stability
> Minor version: API stability within fix versions
> Fix version: Bug fixes
> 
> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
> make a breaking change to the memory format without increasing the
> major version. We also have the added protection of a version enum in
> the metadata
> 
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
> 
> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <we...@gmail.com> wrote:
>> Given the nature of the Arrow project, where any number of different
>> implementations will be in flux at any given time, claiming any sort
>> of API stability at the code level across the whole project seems
>> impossible any time soon.
>> 
>> The important commitment of a 1.0 release is that the metadata and
>> memory format is not changing (without a change in the major version
>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>> memory format and serialized metadata representation. That is, the
>> files in
>> 
>> https://github.com/apache/arrow/tree/master/format
>> 
>> Having this kind of stability is really important so that if any
>> systems know how to parse or emit Arrow 1.x data, but aren't
>> necessarily using the libraries provided by the project, they can have
>> some assurance that we aren't going to break the Flatbuffers or the
>> arrangement of bytes in a record batch on the wire. If that makes
>> sense.
>> 
>> - Wes
>> 
>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>>> 
>>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>>> 
>>> Julian
>>> 
>>> 
>>> 
>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>>> 
>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>> integration testing remaining data types. We are so close to having
>>>> everything tested and stable, we should push to complete these as soon
>>>> as possible (save for Map, which has only just been added to the
>>>> metadata)
>>>> 
>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>> I agree those things would be nice to have. Hardening the memory
>>>>> format details probably would not take longer than a month or so if we
>>>>> were to focus in on it.
>>>>> 
>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>> require a design period and then initial implementation. I think
>>>>> having the streaming format implementations is a good start, but the
>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>> metadata only, or only record batches given a known schema. We should
>>>>> create a proposal document (Google docs?) for the community to comment
>>>>> where we can iterate on requirements
>>>>> 
>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>> streams less monolithic was intended to help with that.
>>>>> 
>>>>> - Wes
>>>>> 
>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>>> Top things on my list:
>>>>>> 
>>>>>> - Formalize Arrow RPC and/or REST
>>>>>> - Some reference transformation algorithms
>>>>>> - Prototype IPC
>>>>>> 
>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>> 
>>>>>>> hi folks,
>>>>>>> 
>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>> business" and can be relied upon in production applications (which
>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>> essential thing.
>>>>>>> 
>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>> release. I think it would be a good show of project stability /
>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>> continue to evolve).
>>>>>>> 
>>>>>>> The main things on my end are hardening the memory format and
>>>>>>> integration tests for the remaining data types:
>>>>>>> 
>>>>>>> - Decimals
>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>   - Need integration tests
>>>>>>> - Fixed size list
>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>> - Union
>>>>>>>   - Two kinds of unions, Java only implements one. Need integration tests
>>>>>>> 
>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>> specification.
>>>>>>> 
>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>> modifications to the metadata.
>>>>>>> 
>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>> principle to guide our efforts
>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>> features.
>>>>>>> 
>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>> first tags accompanying Arrow messages
>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>> an appropriate exception.
>>>>>>> 
>>>>>>> There are some other things that would be nice to prototype or
>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>> calls).
>>>>>>> 
>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>> maintenance mode.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>> 
>>> 


Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
I see the semantic versioning like this:

Major version: Format and Metadata stability
Minor version: API stability within fix versions
Fix version: Bug fixes

So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
make a breaking change to the memory format without increasing the
major version. We also have the added protection of a version enum in
the metadata

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22

On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <we...@gmail.com> wrote:
> Given the nature of the Arrow project, where any number of different
> implementations will be in flux at any given time, claiming any sort
> of API stability at the code level across the whole project seems
> impossible any time soon.
>
> The important commitment of a 1.0 release is that the metadata and
> memory format is not changing (without a change in the major version
> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
> memory format and serialized metadata representation. That is, the
> files in
>
> https://github.com/apache/arrow/tree/master/format
>
> Having this kind of stability is really important so that if any
> systems know how to parse or emit Arrow 1.x data, but aren't
> necessarily using the libraries provided by the project, they can have
> some assurance that we aren't going to break the Flatbuffers or the
> arrangement of bytes in a record batch on the wire. If that makes
> sense.
>
> - Wes
>
> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>>
>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>>
>> Julian
>>
>>
>>
>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>> integration testing remaining data types. We are so close to having
>>> everything tested and stable, we should push to complete these as soon
>>> as possible (save for Map, which has only just been added to the
>>> metadata)
>>>
>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>> I agree those things would be nice to have. Hardening the memory
>>>> format details probably would not take longer than a month or so if we
>>>> were to focus in on it.
>>>>
>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>> require a design period and then initial implementation. I think
>>>> having the streaming format implementations is a good start, but the
>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>> metadata only, or only record batches given a known schema. We should
>>>> create a proposal document (Google docs?) for the community to comment
>>>> where we can iterate on requirements
>>>>
>>>> Separately, I'm interested in embedding Arrow streams in other
>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>> streams less monolithic was intended to help with that.
>>>>
>>>> - Wes
>>>>
>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>> Top things on my list:
>>>>>
>>>>> - Formalize Arrow RPC and/or REST
>>>>> - Some reference transformation algorithms
>>>>> - Prototype IPC
>>>>>
>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>>> hi folks,
>>>>>>
>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>> business" and can be relied upon in production applications (which
>>>>>> some reasonable tolerance for library API changes from major release
>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>> essential thing.
>>>>>>
>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>> release. I think it would be a good show of project stability /
>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>> continue to evolve).
>>>>>>
>>>>>> The main things on my end are hardening the memory format and
>>>>>> integration tests for the remaining data types:
>>>>>>
>>>>>> - Decimals
>>>>>>    - Lingering issues with 128-bit decimals
>>>>>>    - Need integration tests
>>>>>>  - Fixed size list
>>>>>>    - Java has implemented, but not C++. Need integration tests
>>>>>>  - Union
>>>>>>    - Two kinds of unions, Java only implements one. Need integration tests
>>>>>>
>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>> specification.
>>>>>>
>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>> modifications to the metadata.
>>>>>>
>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>> principle to guide our efforts
>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>> would be possible to add compression or encoding at the field level
>>>>>> without disrupting earlier versions of the software that lack these
>>>>>> features.
>>>>>>
>>>>>> In the event that we do need to change the metadata or memory format
>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>> first tags accompanying Arrow messages
>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>> an appropriate exception.
>>>>>>
>>>>>> There are some other things that would be nice to prototype or
>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>> calls).
>>>>>>
>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>> maintenance mode.
>>>>>>
>>>>>> Thanks
>>>>>> Wes
>>>>>>
>>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
Yes, definitely, sorry to not make that more clear. As part of this
process we should draw up a documentation page about how to interpret
the version numbers as a third party user, and how we will handle
documenting experimental features. For example, we might add an
experimental new logical type and decide after a few minor versions
that we need to change its memory representation.

On Wed, Jul 26, 2017 at 3:03 PM, Julian Hyde <jh...@apache.org> wrote:
> It sounds as if you agree with me: It is very important that we clearly state which bits of Arrow are fixed and which bits are not.
>
>> On Jul 26, 2017, at 11:56 AM, Wes McKinney <we...@gmail.com> wrote:
>>
>> Given the nature of the Arrow project, where any number of different
>> implementations will be in flux at any given time, claiming any sort
>> of API stability at the code level across the whole project seems
>> impossible any time soon.
>>
>> The important commitment of a 1.0 release is that the metadata and
>> memory format is not changing (without a change in the major version
>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>> memory format and serialized metadata representation. That is, the
>> files in
>>
>> https://github.com/apache/arrow/tree/master/format
>>
>> Having this kind of stability is really important so that if any
>> systems know how to parse or emit Arrow 1.x data, but aren't
>> necessarily using the libraries provided by the project, they can have
>> some assurance that we aren't going to break the Flatbuffers or the
>> arrangement of bytes in a record batch on the wire. If that makes
>> sense.
>>
>> - Wes
>>
>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>>>
>>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>>>
>>> Julian
>>>
>>>
>>>
>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>> integration testing remaining data types. We are so close to having
>>>> everything tested and stable, we should push to complete these as soon
>>>> as possible (save for Map, which has only just been added to the
>>>> metadata)
>>>>
>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>>> I agree those things would be nice to have. Hardening the memory
>>>>> format details probably would not take longer than a month or so if we
>>>>> were to focus in on it.
>>>>>
>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>> require a design period and then initial implementation. I think
>>>>> having the streaming format implementations is a good start, but the
>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>> metadata only, or only record batches given a known schema. We should
>>>>> create a proposal document (Google docs?) for the community to comment
>>>>> where we can iterate on requirements
>>>>>
>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>> streams less monolithic was intended to help with that.
>>>>>
>>>>> - Wes
>>>>>
>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>>> Top things on my list:
>>>>>>
>>>>>> - Formalize Arrow RPC and/or REST
>>>>>> - Some reference transformation algorithms
>>>>>> - Prototype IPC
>>>>>>
>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>
>>>>>>> hi folks,
>>>>>>>
>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>> business" and can be relied upon in production applications (which
>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>> essential thing.
>>>>>>>
>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>> release. I think it would be a good show of project stability /
>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>> continue to evolve).
>>>>>>>
>>>>>>> The main things on my end are hardening the memory format and
>>>>>>> integration tests for the remaining data types:
>>>>>>>
>>>>>>> - Decimals
>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>   - Need integration tests
>>>>>>> - Fixed size list
>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>> - Union
>>>>>>>   - Two kinds of unions, Java only implements one. Need integration tests
>>>>>>>
>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>> specification.
>>>>>>>
>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>> modifications to the metadata.
>>>>>>>
>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>> principle to guide our efforts
>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>> features.
>>>>>>>
>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>> first tags accompanying Arrow messages
>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>> an appropriate exception.
>>>>>>>
>>>>>>> There are some other things that would be nice to prototype or
>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>> calls).
>>>>>>>
>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>> maintenance mode.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>>
>>>
>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Julian Hyde <jh...@apache.org>.
It sounds as if you agree with me: It is very important that we clearly state which bits of Arrow are fixed and which bits are not.

> On Jul 26, 2017, at 11:56 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Given the nature of the Arrow project, where any number of different
> implementations will be in flux at any given time, claiming any sort
> of API stability at the code level across the whole project seems
> impossible any time soon.
> 
> The important commitment of a 1.0 release is that the metadata and
> memory format is not changing (without a change in the major version
> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
> memory format and serialized metadata representation. That is, the
> files in
> 
> https://github.com/apache/arrow/tree/master/format
> 
> Having this kind of stability is really important so that if any
> systems know how to parse or emit Arrow 1.x data, but aren't
> necessarily using the libraries provided by the project, they can have
> some assurance that we aren't going to break the Flatbuffers or the
> arrangement of bytes in a record batch on the wire. If that makes
> sense.
> 
> - Wes
> 
> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>> 
>> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>> 
>> Julian
>> 
>> 
>> 
>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>> integration testing remaining data types. We are so close to having
>>> everything tested and stable, we should push to complete these as soon
>>> as possible (save for Map, which has only just been added to the
>>> metadata)
>>> 
>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>>> I agree those things would be nice to have. Hardening the memory
>>>> format details probably would not take longer than a month or so if we
>>>> were to focus in on it.
>>>> 
>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>> require a design period and then initial implementation. I think
>>>> having the streaming format implementations is a good start, but the
>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>> metadata only, or only record batches given a known schema. We should
>>>> create a proposal document (Google docs?) for the community to comment
>>>> where we can iterate on requirements
>>>> 
>>>> Separately, I'm interested in embedding Arrow streams in other
>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>> streams less monolithic was intended to help with that.
>>>> 
>>>> - Wes
>>>> 
>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>> Top things on my list:
>>>>> 
>>>>> - Formalize Arrow RPC and/or REST
>>>>> - Some reference transformation algorithms
>>>>> - Prototype IPC
>>>>> 
>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>> 
>>>>>> hi folks,
>>>>>> 
>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>> business" and can be relied upon in production applications (which
>>>>>> some reasonable tolerance for library API changes from major release
>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>> essential thing.
>>>>>> 
>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>> release. I think it would be a good show of project stability /
>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>> continue to evolve).
>>>>>> 
>>>>>> The main things on my end are hardening the memory format and
>>>>>> integration tests for the remaining data types:
>>>>>> 
>>>>>> - Decimals
>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>   - Need integration tests
>>>>>> - Fixed size list
>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>> - Union
>>>>>>   - Two kinds of unions, Java only implements one. Need integration tests
>>>>>> 
>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>> specification.
>>>>>> 
>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>> modifications to the metadata.
>>>>>> 
>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>> principle to guide our efforts
>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>> would be possible to add compression or encoding at the field level
>>>>>> without disrupting earlier versions of the software that lack these
>>>>>> features.
>>>>>> 
>>>>>> In the event that we do need to change the metadata or memory format
>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>> first tags accompanying Arrow messages
>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>> an appropriate exception.
>>>>>> 
>>>>>> There are some other things that would be nice to prototype or
>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>> calls).
>>>>>> 
>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>> maintenance mode.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>> 


Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
Given the nature of the Arrow project, where any number of different
implementations will be in flux at any given time, claiming any sort
of API stability at the code level across the whole project seems
impossible any time soon.

The important commitment of a 1.0 release is that the metadata and
memory format is not changing (without a change in the major version
number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
memory format and serialized metadata representation. That is, the
files in

https://github.com/apache/arrow/tree/master/format

Having this kind of stability is really important so that if any
systems know how to parse or emit Arrow 1.x data, but aren't
necessarily using the libraries provided by the project, they can have
some assurance that we aren't going to break the Flatbuffers or the
arrangement of bytes in a record batch on the wire. If that makes
sense.

- Wes

On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote:
> 1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.
>
> So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.
>
> Julian
>
>
>
>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
>>
>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>> integration testing remaining data types. We are so close to having
>> everything tested and stable, we should push to complete these as soon
>> as possible (save for Map, which has only just been added to the
>> metadata)
>>
>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>>> I agree those things would be nice to have. Hardening the memory
>>> format details probably would not take longer than a month or so if we
>>> were to focus in on it.
>>>
>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>> require a design period and then initial implementation. I think
>>> having the streaming format implementations is a good start, but the
>>> streams are a bit monothic -- e.g. in REST you might want to request
>>> metadata only, or only record batches given a known schema. We should
>>> create a proposal document (Google docs?) for the community to comment
>>> where we can iterate on requirements
>>>
>>> Separately, I'm interested in embedding Arrow streams in other
>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>> streams less monolithic was intended to help with that.
>>>
>>> - Wes
>>>
>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>> Top things on my list:
>>>>
>>>> - Formalize Arrow RPC and/or REST
>>>> - Some reference transformation algorithms
>>>> - Prototype IPC
>>>>
>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>>> hi folks,
>>>>>
>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>> business" and can be relied upon in production applications (which
>>>>> some reasonable tolerance for library API changes from major release
>>>>> to major release). I hope we can all agree that forward and backward
>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>> essential thing.
>>>>>
>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>> release. I think it would be a good show of project stability /
>>>>> production-readiness to do this (with the caveat the APIs will
>>>>> continue to evolve).
>>>>>
>>>>> The main things on my end are hardening the memory format and
>>>>> integration tests for the remaining data types:
>>>>>
>>>>> - Decimals
>>>>>    - Lingering issues with 128-bit decimals
>>>>>    - Need integration tests
>>>>>  - Fixed size list
>>>>>    - Java has implemented, but not C++. Need integration tests
>>>>>  - Union
>>>>>    - Two kinds of unions, Java only implements one. Need integration tests
>>>>>
>>>>> On these, Decimals have the most work since the memory format needs to
>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>> think this is going to be too much work, but it needs to get sorted
>>>>> out so we don't have incomplete or under-tested parts of the
>>>>> specification.
>>>>>
>>>>> There's some other things being discussed, like a Map logical type,
>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>> modifications to the metadata.
>>>>>
>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>> principle to guide our efforts
>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>> would be possible to add compression or encoding at the field level
>>>>> without disrupting earlier versions of the software that lack these
>>>>> features.
>>>>>
>>>>> In the event that we do need to change the metadata or memory format
>>>>> in the future (which would probably be an extreme circumstance), we
>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>> first tags accompanying Arrow messages
>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>> So if you encounter a message that you do not support, you can raise
>>>>> an appropriate exception.
>>>>>
>>>>> There are some other things that would be nice to prototype or
>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>> calls).
>>>>>
>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>> development? One idea would be if we need to make any breaking changes
>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>> maintenance mode.
>>>>>
>>>>> Thanks
>>>>> Wes
>>>>>
>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Julian Hyde <jh...@apache.org>.
1.0 is a Big Deal because, under semantic versioning, there is a commitment to not change public APIs. If it weren’t for that, 1.0 would have vague marketing connotations of robustness, adoption etc. but otherwise be no different from another release.

So, if API and data format lifecycle and compatibility is the goal here, would it be useful to introduce explicit flags on API maturity? Call out which APIs are public, and therefore bound by the semantic versioning contract. This will also give Arrow some room to add experimental features after 1.0, and avoid calcification.

Julian



> On Jul 26, 2017, at 7:40 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> I created https://issues.apache.org/jira/browse/ARROW-1277 about
> integration testing remaining data types. We are so close to having
> everything tested and stable, we should push to complete these as soon
> as possible (save for Map, which has only just been added to the
> metadata)
> 
> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
>> I agree those things would be nice to have. Hardening the memory
>> format details probably would not take longer than a month or so if we
>> were to focus in on it.
>> 
>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>> require a design period and then initial implementation. I think
>> having the streaming format implementations is a good start, but the
>> streams are a bit monothic -- e.g. in REST you might want to request
>> metadata only, or only record batches given a known schema. We should
>> create a proposal document (Google docs?) for the community to comment
>> where we can iterate on requirements
>> 
>> Separately, I'm interested in embedding Arrow streams in other
>> transport layers, like GRPC. The recent refactoring in C++ to make the
>> streams less monolithic was intended to help with that.
>> 
>> - Wes
>> 
>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>> Top things on my list:
>>> 
>>> - Formalize Arrow RPC and/or REST
>>> - Some reference transformation algorithms
>>> - Prototype IPC
>>> 
>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>>> hi folks,
>>>> 
>>>> In recent discussions, since the Arrow memory format and metadata has
>>>> become reasonably stabilized, and we're more likely to add new data
>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>> declare to the rest of the open source world that "Arrow is open for
>>>> business" and can be relied upon in production applications (which
>>>> some reasonable tolerance for library API changes from major release
>>>> to major release). I hope we can all agree that forward and backward
>>>> compatibility in the zero-copy wire format and metadata is the most
>>>> essential thing.
>>>> 
>>>> To that end, I'd like to collect ideas for what needs to be
>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>> release. I think it would be a good show of project stability /
>>>> production-readiness to do this (with the caveat the APIs will
>>>> continue to evolve).
>>>> 
>>>> The main things on my end are hardening the memory format and
>>>> integration tests for the remaining data types:
>>>> 
>>>> - Decimals
>>>>    - Lingering issues with 128-bit decimals
>>>>    - Need integration tests
>>>>  - Fixed size list
>>>>    - Java has implemented, but not C++. Need integration tests
>>>>  - Union
>>>>    - Two kinds of unions, Java only implements one. Need integration tests
>>>> 
>>>> On these, Decimals have the most work since the memory format needs to
>>>> be specified. On Unions, we may decide to not implement the dense
>>>> variant and focus on integration testing the sparse variant. I don't
>>>> think this is going to be too much work, but it needs to get sorted
>>>> out so we don't have incomplete or under-tested parts of the
>>>> specification.
>>>> 
>>>> There's some other things being discussed, like a Map logical type,
>>>> but that (at least as currently proposed) won't require any disruptive
>>>> modifications to the metadata.
>>>> 
>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>> principle to guide our efforts
>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>> would be possible to add compression or encoding at the field level
>>>> without disrupting earlier versions of the software that lack these
>>>> features.
>>>> 
>>>> In the event that we do need to change the metadata or memory format
>>>> in the future (which would probably be an extreme circumstance), we
>>>> have the option of increasing the MetadataVersion which is one of the
>>>> first tags accompanying Arrow messages
>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>> So if you encounter a message that you do not support, you can raise
>>>> an appropriate exception.
>>>> 
>>>> There are some other things that would be nice to prototype or
>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>> client-server model (sending Arrow record batches via REST HTTP
>>>> calls).
>>>> 
>>>> Anything else that would need to go to move to a 1.x mainline for
>>>> development? One idea would be if we need to make any breaking changes
>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>> maintenance mode.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 


Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
I created https://issues.apache.org/jira/browse/ARROW-1277 about
integration testing remaining data types. We are so close to having
everything tested and stable, we should push to complete these as soon
as possible (save for Map, which has only just been added to the
metadata)

On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <we...@gmail.com> wrote:
> I agree those things would be nice to have. Hardening the memory
> format details probably would not take longer than a month or so if we
> were to focus in on it.
>
> Formalizing REST / RPC or IPC seems like it will be more work, or will
> require a design period and then initial implementation. I think
> having the streaming format implementations is a good start, but the
> streams are a bit monothic -- e.g. in REST you might want to request
> metadata only, or only record batches given a known schema. We should
> create a proposal document (Google docs?) for the community to comment
> where we can iterate on requirements
>
> Separately, I'm interested in embedding Arrow streams in other
> transport layers, like GRPC. The recent refactoring in C++ to make the
> streams less monolithic was intended to help with that.
>
> - Wes
>
> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> Top things on my list:
>>
>> - Formalize Arrow RPC and/or REST
>> - Some reference transformation algorithms
>> - Prototype IPC
>>
>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi folks,
>>>
>>> In recent discussions, since the Arrow memory format and metadata has
>>> become reasonably stabilized, and we're more likely to add new data
>>> types than change existing ones, we may consider making a 1.0.0 to
>>> declare to the rest of the open source world that "Arrow is open for
>>> business" and can be relied upon in production applications (which
>>> some reasonable tolerance for library API changes from major release
>>> to major release). I hope we can all agree that forward and backward
>>> compatibility in the zero-copy wire format and metadata is the most
>>> essential thing.
>>>
>>> To that end, I'd like to collect ideas for what needs to be
>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>> release. I think it would be a good show of project stability /
>>> production-readiness to do this (with the caveat the APIs will
>>> continue to evolve).
>>>
>>> The main things on my end are hardening the memory format and
>>> integration tests for the remaining data types:
>>>
>>> - Decimals
>>>     - Lingering issues with 128-bit decimals
>>>     - Need integration tests
>>>   - Fixed size list
>>>     - Java has implemented, but not C++. Need integration tests
>>>   - Union
>>>     - Two kinds of unions, Java only implements one. Need integration tests
>>>
>>> On these, Decimals have the most work since the memory format needs to
>>> be specified. On Unions, we may decide to not implement the dense
>>> variant and focus on integration testing the sparse variant. I don't
>>> think this is going to be too much work, but it needs to get sorted
>>> out so we don't have incomplete or under-tested parts of the
>>> specification.
>>>
>>> There's some other things being discussed, like a Map logical type,
>>> but that (at least as currently proposed) won't require any disruptive
>>> modifications to the metadata.
>>>
>>> As far as the metadata and memory format, we would use the Open/Closed
>>> principle to guide our efforts
>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>> would be possible to add compression or encoding at the field level
>>> without disrupting earlier versions of the software that lack these
>>> features.
>>>
>>> In the event that we do need to change the metadata or memory format
>>> in the future (which would probably be an extreme circumstance), we
>>> have the option of increasing the MetadataVersion which is one of the
>>> first tags accompanying Arrow messages
>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>> So if you encounter a message that you do not support, you can raise
>>> an appropriate exception.
>>>
>>> There are some other things that would be nice to prototype or
>>> specify, like a REST protocol for exposing Arrow datasets in a
>>> client-server model (sending Arrow record batches via REST HTTP
>>> calls).
>>>
>>> Anything else that would need to go to move to a 1.x mainline for
>>> development? One idea would be if we need to make any breaking changes
>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>> maintenance mode.
>>>
>>> Thanks
>>> Wes
>>>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Wes McKinney <we...@gmail.com>.
I agree those things would be nice to have. Hardening the memory
format details probably would not take longer than a month or so if we
were to focus in on it.

Formalizing REST / RPC or IPC seems like it will be more work, or will
require a design period and then initial implementation. I think
having the streaming format implementations is a good start, but the
streams are a bit monothic -- e.g. in REST you might want to request
metadata only, or only record batches given a known schema. We should
create a proposal document (Google docs?) for the community to comment
where we can iterate on requirements

Separately, I'm interested in embedding Arrow streams in other
transport layers, like GRPC. The recent refactoring in C++ to make the
streams less monolithic was intended to help with that.

- Wes

On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <ja...@apache.org> wrote:
> Top things on my list:
>
> - Formalize Arrow RPC and/or REST
> - Some reference transformation algorithms
> - Prototype IPC
>
> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi folks,
>>
>> In recent discussions, since the Arrow memory format and metadata has
>> become reasonably stabilized, and we're more likely to add new data
>> types than change existing ones, we may consider making a 1.0.0 to
>> declare to the rest of the open source world that "Arrow is open for
>> business" and can be relied upon in production applications (which
>> some reasonable tolerance for library API changes from major release
>> to major release). I hope we can all agree that forward and backward
>> compatibility in the zero-copy wire format and metadata is the most
>> essential thing.
>>
>> To that end, I'd like to collect ideas for what needs to be
>> accomplished in the project before we'd be comfortable making a 1.0.0
>> release. I think it would be a good show of project stability /
>> production-readiness to do this (with the caveat the APIs will
>> continue to evolve).
>>
>> The main things on my end are hardening the memory format and
>> integration tests for the remaining data types:
>>
>> - Decimals
>>     - Lingering issues with 128-bit decimals
>>     - Need integration tests
>>   - Fixed size list
>>     - Java has implemented, but not C++. Need integration tests
>>   - Union
>>     - Two kinds of unions, Java only implements one. Need integration tests
>>
>> On these, Decimals have the most work since the memory format needs to
>> be specified. On Unions, we may decide to not implement the dense
>> variant and focus on integration testing the sparse variant. I don't
>> think this is going to be too much work, but it needs to get sorted
>> out so we don't have incomplete or under-tested parts of the
>> specification.
>>
>> There's some other things being discussed, like a Map logical type,
>> but that (at least as currently proposed) won't require any disruptive
>> modifications to the metadata.
>>
>> As far as the metadata and memory format, we would use the Open/Closed
>> principle to guide our efforts
>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>> would be possible to add compression or encoding at the field level
>> without disrupting earlier versions of the software that lack these
>> features.
>>
>> In the event that we do need to change the metadata or memory format
>> in the future (which would probably be an extreme circumstance), we
>> have the option of increasing the MetadataVersion which is one of the
>> first tags accompanying Arrow messages
>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>> So if you encounter a message that you do not support, you can raise
>> an appropriate exception.
>>
>> There are some other things that would be nice to prototype or
>> specify, like a REST protocol for exposing Arrow datasets in a
>> client-server model (sending Arrow record batches via REST HTTP
>> calls).
>>
>> Anything else that would need to go to move to a 1.x mainline for
>> development? One idea would be if we need to make any breaking changes
>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>> maintenance mode.
>>
>> Thanks
>> Wes
>>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Posted by Jacques Nadeau <ja...@apache.org>.
Top things on my list:

- Formalize Arrow RPC and/or REST
- Some reference transformation algorithms
- Prototype IPC

On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> In recent discussions, since the Arrow memory format and metadata has
> become reasonably stabilized, and we're more likely to add new data
> types than change existing ones, we may consider making a 1.0.0 to
> declare to the rest of the open source world that "Arrow is open for
> business" and can be relied upon in production applications (which
> some reasonable tolerance for library API changes from major release
> to major release). I hope we can all agree that forward and backward
> compatibility in the zero-copy wire format and metadata is the most
> essential thing.
>
> To that end, I'd like to collect ideas for what needs to be
> accomplished in the project before we'd be comfortable making a 1.0.0
> release. I think it would be a good show of project stability /
> production-readiness to do this (with the caveat the APIs will
> continue to evolve).
>
> The main things on my end are hardening the memory format and
> integration tests for the remaining data types:
>
> - Decimals
>     - Lingering issues with 128-bit decimals
>     - Need integration tests
>   - Fixed size list
>     - Java has implemented, but not C++. Need integration tests
>   - Union
>     - Two kinds of unions, Java only implements one. Need integration tests
>
> On these, Decimals have the most work since the memory format needs to
> be specified. On Unions, we may decide to not implement the dense
> variant and focus on integration testing the sparse variant. I don't
> think this is going to be too much work, but it needs to get sorted
> out so we don't have incomplete or under-tested parts of the
> specification.
>
> There's some other things being discussed, like a Map logical type,
> but that (at least as currently proposed) won't require any disruptive
> modifications to the metadata.
>
> As far as the metadata and memory format, we would use the Open/Closed
> principle to guide our efforts
> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
> would be possible to add compression or encoding at the field level
> without disrupting earlier versions of the software that lack these
> features.
>
> In the event that we do need to change the metadata or memory format
> in the future (which would probably be an extreme circumstance), we
> have the option of increasing the MetadataVersion which is one of the
> first tags accompanying Arrow messages
> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
> So if you encounter a message that you do not support, you can raise
> an appropriate exception.
>
> There are some other things that would be nice to prototype or
> specify, like a REST protocol for exposing Arrow datasets in a
> client-server model (sending Arrow record batches via REST HTTP
> calls).
>
> Anything else that would need to go to move to a 1.x mainline for
> development? One idea would be if we need to make any breaking changes
> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
> maintenance mode.
>
> Thanks
> Wes
>