You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@impala.apache.org by Wes McKinney <we...@gmail.com> on 2017/02/25 22:18:36 UTC

[DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Dear Apache Kudu and Apache Impala (incubating) communities,

(I'm not sure the best way to have a cross-list discussion, so I
apologize if this does not work well)

On the recent Apache Parquet sync call, we discussed C++ code sharing
between the codebases in Apache Arrow and Apache Parquet, and
opportunities for more code sharing with Kudu and Impala as well.

As context

* We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
first C++ release within Apache Parquet. I got involved with this
project a little over a year ago and was faced with the unpleasant
decision to copy and paste a significant amount of code out of
Impala's codebase to bootstrap the project.

* In parallel, we begin the Apache Arrow project, which is designed to
be a complementary library for file formats (like Parquet), storage
engines (like Kudu), and compute engines (like Impala and pandas).

* As Arrow and parquet-cpp matured, an increasing amount of code
overlap crept up surrounding buffer memory management and IO
interface. We recently decided in PARQUET-818
(https://github.com/apache/parquet-cpp/commit/2154e873d5aa7280314189a2683fb1e12a590c02)
to remove some of the obvious code overlap in Parquet and make
libarrow.a/so a hard compile and link-time dependency for
libparquet.a/so.

* There is still quite a bit of code in parquet-cpp that would better
fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
compression, bit utilities, and so forth. Much of this code originated
from Impala

This brings me to a next set of points:

* parquet-cpp contains quite a bit of code that was extracted from
Impala. This is mostly self-contained in
https://github.com/apache/parquet-cpp/tree/master/src/parquet/util

* My understanding is that Kudu extracted certain computational
utilities from Impala in its early days, but these tools have likely
diverged as the needs of the projects have evolved.

Since all of these projects are quite different in their end goals
(runtime systems vs. libraries), touching code that is tightly coupled
to either Kudu or Impala's runtimes is probably not worth discussing.
However, I think there is a strong basis for collaboration on
computational utilities and vectorized array processing. Some obvious
areas that come to mind:

* SIMD utilities (for hashing or processing of preallocated contiguous memory)
* Array encoding utilities: RLE / Dictionary, etc.
* Bit manipulation (packing and unpacking, e.g. Daniel Lemire
contributed a patch to parquet-cpp around this)
* Date and time utilities
* Compression utilities

I hope the benefits are obvious: consolidating efforts on unit
testing, benchmarking, performance optimizations, continuous
integration, and platform compatibility.

Logistically speaking, one possible avenue might be to use Apache
Arrow as the place to assemble this code. Its thirdparty toolchain is
small, and it builds and installs fast. It is intended as a library to
have its headers used and linked against other applications. (As an
aside, I'm very interested in building optional support for Arrow
columnar messages into the kudu client).

The downside of code sharing, which may have prevented it so far, are
the logistics of coordinating ASF release cycles and keeping build
toolchains in sync. It's taken us the past year to stabilize the
design of Arrow for its intended use cases, so at this point if we
went down this road I would be OK with helping the community commit to
a regular release cadence that would be faster than Impala, Kudu, and
Parquet's respective release cadences. Since members of the Kudu and
Impala PMC are also on the Arrow PMC, I trust we would be able to
collaborate to each other's mutual benefit and success.

Note that Arrow does not throw C++ exceptions and similarly follows
Google C++ style guide to the same extent at Kudu and Impala.

If this is something that either the Kudu or Impala communities would
like to pursue in earnest, I would be happy to work with you on next
steps. I would suggest that we start with something small so that we
could address the necessary build toolchain changes, and develop a
workflow for moving around code and tests, a protocol for code reviews
(e.g. Gerrit), and coordinating ASF releases.

Let me know what you think.

best
Wes

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Miki,

No, I don't think so. APR is a portable C library. The code we are
talking about would be intended for use in C++11/14 projects like
Impala and Kudu (and Arrow and Parquet).

Wes

On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <mi...@gmail.com> wrote:
> Can't some (most) of it be added to APR <https://apr.apache.org/>?
>
> On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Henry,
>>
>> Thank you for these comments.
>>
>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>> ideal (though perhaps initially more labor intensive) solution.
>> There's code in Arrow that I would move into this project if it
>> existed. I am happy to help make this happen if there is interest from
>> the Kudu and Impala communities. I am not sure logistically what would
>> be the most expedient way to establish the project, whether as an ASF
>> Incubator project or possibly as a new TLP that could be created by
>> spinning IP out of Apache Kudu.
>>
>> I'm interested to hear the opinions of others, and possible next steps.
>>
>> Thanks
>> Wes
>>
>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
>> > Thanks for bringing this up, Wes.
>> >
>> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> >>
>> >> (I'm not sure the best way to have a cross-list discussion, so I
>> >> apologize if this does not work well)
>> >>
>> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> >> between the codebases in Apache Arrow and Apache Parquet, and
>> >> opportunities for more code sharing with Kudu and Impala as well.
>> >>
>> >> As context
>> >>
>> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> >> first C++ release within Apache Parquet. I got involved with this
>> >> project a little over a year ago and was faced with the unpleasant
>> >> decision to copy and paste a significant amount of code out of
>> >> Impala's codebase to bootstrap the project.
>> >>
>> >> * In parallel, we begin the Apache Arrow project, which is designed to
>> >> be a complementary library for file formats (like Parquet), storage
>> >> engines (like Kudu), and compute engines (like Impala and pandas).
>> >>
>> >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> >> overlap crept up surrounding buffer memory management and IO
>> >> interface. We recently decided in PARQUET-818
>> >> (https://github.com/apache/parquet-cpp/commit/
>> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> >> to remove some of the obvious code overlap in Parquet and make
>> >> libarrow.a/so a hard compile and link-time dependency for
>> >> libparquet.a/so.
>> >>
>> >> * There is still quite a bit of code in parquet-cpp that would better
>> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> >> compression, bit utilities, and so forth. Much of this code originated
>> >> from Impala
>> >>
>> >> This brings me to a next set of points:
>> >>
>> >> * parquet-cpp contains quite a bit of code that was extracted from
>> >> Impala. This is mostly self-contained in
>> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> >>
>> >> * My understanding is that Kudu extracted certain computational
>> >> utilities from Impala in its early days, but these tools have likely
>> >> diverged as the needs of the projects have evolved.
>> >>
>> >> Since all of these projects are quite different in their end goals
>> >> (runtime systems vs. libraries), touching code that is tightly coupled
>> >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> >> However, I think there is a strong basis for collaboration on
>> >> computational utilities and vectorized array processing. Some obvious
>> >> areas that come to mind:
>> >>
>> >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> >> memory)
>> >> * Array encoding utilities: RLE / Dictionary, etc.
>> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> >> contributed a patch to parquet-cpp around this)
>> >> * Date and time utilities
>> >> * Compression utilities
>> >>
>> >
>> > Between Kudu and Impala (at least) there are many more opportunities for
>> > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > quite long.
>> >
>> >
>> >>
>> >> I hope the benefits are obvious: consolidating efforts on unit
>> >> testing, benchmarking, performance optimizations, continuous
>> >> integration, and platform compatibility.
>> >>
>> >> Logistically speaking, one possible avenue might be to use Apache
>> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> >> small, and it builds and installs fast. It is intended as a library to
>> >> have its headers used and linked against other applications. (As an
>> >> aside, I'm very interested in building optional support for Arrow
>> >> columnar messages into the kudu client).
>> >>
>> >
>> > In principle I'm in favour of code sharing, and it seems very much in
>> > keeping with the Apache way. However, practically speaking I'm of the
>> > opinion that it only makes sense to house shared support code in a
>> > separate, dedicated project.
>> >
>> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
>> > of sharing to utilities that Arrow is interested in. It would make no
>> sense
>> > to add a threading library to Arrow if it was never used natively.
>> Muddying
>> > the waters of the project's charter seems likely to lead to user, and
>> > developer, confusion. Similarly, we should not necessarily couple Arrow's
>> > design goals to those it inherits from Kudu and Impala's source code.
>> >
>> > I think I'd rather see a new Apache project than re-use a current one for
>> > two independent purposes.
>> >
>> >
>> >>
>> >> The downside of code sharing, which may have prevented it so far, are
>> >> the logistics of coordinating ASF release cycles and keeping build
>> >> toolchains in sync. It's taken us the past year to stabilize the
>> >> design of Arrow for its intended use cases, so at this point if we
>> >> went down this road I would be OK with helping the community commit to
>> >> a regular release cadence that would be faster than Impala, Kudu, and
>> >> Parquet's respective release cadences. Since members of the Kudu and
>> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> >> collaborate to each other's mutual benefit and success.
>> >>
>> >> Note that Arrow does not throw C++ exceptions and similarly follows
>> >> Google C++ style guide to the same extent at Kudu and Impala.
>> >>
>> >> If this is something that either the Kudu or Impala communities would
>> >> like to pursue in earnest, I would be happy to work with you on next
>> >> steps. I would suggest that we start with something small so that we
>> >> could address the necessary build toolchain changes, and develop a
>> >> workflow for moving around code and tests, a protocol for code reviews
>> >> (e.g. Gerrit), and coordinating ASF releases.
>> >>
>> >
>> > I think, if I'm reading this correctly, that you're assuming integration
>> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > their toolchains. For something as fast moving as utility code - and
>> > critical, where you want the latency between adding a fix and including
>> it
>> > in your build to be ~0 - that's a non-starter to me, at least with how
>> the
>> > toolchains are currently realised.
>> >
>> > I'd rather have the source code directly imported into Impala's tree -
>> > whether by git submodule or other mechanism. That way the coupling is
>> > looser, and we can move more quickly. I think that's important to other
>> > projects as well.
>> >
>> > Henry
>> >
>> >
>> >
>> >>
>> >> Let me know what you think.
>> >>
>> >> best
>> >> Wes
>> >>
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Miki,

No, I don't think so. APR is a portable C library. The code we are
talking about would be intended for use in C++11/14 projects like
Impala and Kudu (and Arrow and Parquet).

Wes

On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <mi...@gmail.com> wrote:
> Can't some (most) of it be added to APR <https://apr.apache.org/>?
>
> On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Henry,
>>
>> Thank you for these comments.
>>
>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>> ideal (though perhaps initially more labor intensive) solution.
>> There's code in Arrow that I would move into this project if it
>> existed. I am happy to help make this happen if there is interest from
>> the Kudu and Impala communities. I am not sure logistically what would
>> be the most expedient way to establish the project, whether as an ASF
>> Incubator project or possibly as a new TLP that could be created by
>> spinning IP out of Apache Kudu.
>>
>> I'm interested to hear the opinions of others, and possible next steps.
>>
>> Thanks
>> Wes
>>
>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
>> > Thanks for bringing this up, Wes.
>> >
>> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> >>
>> >> (I'm not sure the best way to have a cross-list discussion, so I
>> >> apologize if this does not work well)
>> >>
>> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> >> between the codebases in Apache Arrow and Apache Parquet, and
>> >> opportunities for more code sharing with Kudu and Impala as well.
>> >>
>> >> As context
>> >>
>> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> >> first C++ release within Apache Parquet. I got involved with this
>> >> project a little over a year ago and was faced with the unpleasant
>> >> decision to copy and paste a significant amount of code out of
>> >> Impala's codebase to bootstrap the project.
>> >>
>> >> * In parallel, we begin the Apache Arrow project, which is designed to
>> >> be a complementary library for file formats (like Parquet), storage
>> >> engines (like Kudu), and compute engines (like Impala and pandas).
>> >>
>> >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> >> overlap crept up surrounding buffer memory management and IO
>> >> interface. We recently decided in PARQUET-818
>> >> (https://github.com/apache/parquet-cpp/commit/
>> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> >> to remove some of the obvious code overlap in Parquet and make
>> >> libarrow.a/so a hard compile and link-time dependency for
>> >> libparquet.a/so.
>> >>
>> >> * There is still quite a bit of code in parquet-cpp that would better
>> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> >> compression, bit utilities, and so forth. Much of this code originated
>> >> from Impala
>> >>
>> >> This brings me to a next set of points:
>> >>
>> >> * parquet-cpp contains quite a bit of code that was extracted from
>> >> Impala. This is mostly self-contained in
>> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> >>
>> >> * My understanding is that Kudu extracted certain computational
>> >> utilities from Impala in its early days, but these tools have likely
>> >> diverged as the needs of the projects have evolved.
>> >>
>> >> Since all of these projects are quite different in their end goals
>> >> (runtime systems vs. libraries), touching code that is tightly coupled
>> >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> >> However, I think there is a strong basis for collaboration on
>> >> computational utilities and vectorized array processing. Some obvious
>> >> areas that come to mind:
>> >>
>> >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> >> memory)
>> >> * Array encoding utilities: RLE / Dictionary, etc.
>> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> >> contributed a patch to parquet-cpp around this)
>> >> * Date and time utilities
>> >> * Compression utilities
>> >>
>> >
>> > Between Kudu and Impala (at least) there are many more opportunities for
>> > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > quite long.
>> >
>> >
>> >>
>> >> I hope the benefits are obvious: consolidating efforts on unit
>> >> testing, benchmarking, performance optimizations, continuous
>> >> integration, and platform compatibility.
>> >>
>> >> Logistically speaking, one possible avenue might be to use Apache
>> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> >> small, and it builds and installs fast. It is intended as a library to
>> >> have its headers used and linked against other applications. (As an
>> >> aside, I'm very interested in building optional support for Arrow
>> >> columnar messages into the kudu client).
>> >>
>> >
>> > In principle I'm in favour of code sharing, and it seems very much in
>> > keeping with the Apache way. However, practically speaking I'm of the
>> > opinion that it only makes sense to house shared support code in a
>> > separate, dedicated project.
>> >
>> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
>> > of sharing to utilities that Arrow is interested in. It would make no
>> sense
>> > to add a threading library to Arrow if it was never used natively.
>> Muddying
>> > the waters of the project's charter seems likely to lead to user, and
>> > developer, confusion. Similarly, we should not necessarily couple Arrow's
>> > design goals to those it inherits from Kudu and Impala's source code.
>> >
>> > I think I'd rather see a new Apache project than re-use a current one for
>> > two independent purposes.
>> >
>> >
>> >>
>> >> The downside of code sharing, which may have prevented it so far, are
>> >> the logistics of coordinating ASF release cycles and keeping build
>> >> toolchains in sync. It's taken us the past year to stabilize the
>> >> design of Arrow for its intended use cases, so at this point if we
>> >> went down this road I would be OK with helping the community commit to
>> >> a regular release cadence that would be faster than Impala, Kudu, and
>> >> Parquet's respective release cadences. Since members of the Kudu and
>> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> >> collaborate to each other's mutual benefit and success.
>> >>
>> >> Note that Arrow does not throw C++ exceptions and similarly follows
>> >> Google C++ style guide to the same extent at Kudu and Impala.
>> >>
>> >> If this is something that either the Kudu or Impala communities would
>> >> like to pursue in earnest, I would be happy to work with you on next
>> >> steps. I would suggest that we start with something small so that we
>> >> could address the necessary build toolchain changes, and develop a
>> >> workflow for moving around code and tests, a protocol for code reviews
>> >> (e.g. Gerrit), and coordinating ASF releases.
>> >>
>> >
>> > I think, if I'm reading this correctly, that you're assuming integration
>> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > their toolchains. For something as fast moving as utility code - and
>> > critical, where you want the latency between adding a fix and including
>> it
>> > in your build to be ~0 - that's a non-starter to me, at least with how
>> the
>> > toolchains are currently realised.
>> >
>> > I'd rather have the source code directly imported into Impala's tree -
>> > whether by git submodule or other mechanism. That way the coupling is
>> > looser, and we can move more quickly. I think that's important to other
>> > projects as well.
>> >
>> > Henry
>> >
>> >
>> >
>> >>
>> >> Let me know what you think.
>> >>
>> >> best
>> >> Wes
>> >>
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Miki,

No, I don't think so. APR is a portable C library. The code we are
talking about would be intended for use in C++11/14 projects like
Impala and Kudu (and Arrow and Parquet).

Wes

On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <mi...@gmail.com> wrote:
> Can't some (most) of it be added to APR <https://apr.apache.org/>?
>
> On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Henry,
>>
>> Thank you for these comments.
>>
>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>> ideal (though perhaps initially more labor intensive) solution.
>> There's code in Arrow that I would move into this project if it
>> existed. I am happy to help make this happen if there is interest from
>> the Kudu and Impala communities. I am not sure logistically what would
>> be the most expedient way to establish the project, whether as an ASF
>> Incubator project or possibly as a new TLP that could be created by
>> spinning IP out of Apache Kudu.
>>
>> I'm interested to hear the opinions of others, and possible next steps.
>>
>> Thanks
>> Wes
>>
>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
>> > Thanks for bringing this up, Wes.
>> >
>> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> >>
>> >> (I'm not sure the best way to have a cross-list discussion, so I
>> >> apologize if this does not work well)
>> >>
>> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> >> between the codebases in Apache Arrow and Apache Parquet, and
>> >> opportunities for more code sharing with Kudu and Impala as well.
>> >>
>> >> As context
>> >>
>> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> >> first C++ release within Apache Parquet. I got involved with this
>> >> project a little over a year ago and was faced with the unpleasant
>> >> decision to copy and paste a significant amount of code out of
>> >> Impala's codebase to bootstrap the project.
>> >>
>> >> * In parallel, we begin the Apache Arrow project, which is designed to
>> >> be a complementary library for file formats (like Parquet), storage
>> >> engines (like Kudu), and compute engines (like Impala and pandas).
>> >>
>> >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> >> overlap crept up surrounding buffer memory management and IO
>> >> interface. We recently decided in PARQUET-818
>> >> (https://github.com/apache/parquet-cpp/commit/
>> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> >> to remove some of the obvious code overlap in Parquet and make
>> >> libarrow.a/so a hard compile and link-time dependency for
>> >> libparquet.a/so.
>> >>
>> >> * There is still quite a bit of code in parquet-cpp that would better
>> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> >> compression, bit utilities, and so forth. Much of this code originated
>> >> from Impala
>> >>
>> >> This brings me to a next set of points:
>> >>
>> >> * parquet-cpp contains quite a bit of code that was extracted from
>> >> Impala. This is mostly self-contained in
>> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> >>
>> >> * My understanding is that Kudu extracted certain computational
>> >> utilities from Impala in its early days, but these tools have likely
>> >> diverged as the needs of the projects have evolved.
>> >>
>> >> Since all of these projects are quite different in their end goals
>> >> (runtime systems vs. libraries), touching code that is tightly coupled
>> >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> >> However, I think there is a strong basis for collaboration on
>> >> computational utilities and vectorized array processing. Some obvious
>> >> areas that come to mind:
>> >>
>> >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> >> memory)
>> >> * Array encoding utilities: RLE / Dictionary, etc.
>> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> >> contributed a patch to parquet-cpp around this)
>> >> * Date and time utilities
>> >> * Compression utilities
>> >>
>> >
>> > Between Kudu and Impala (at least) there are many more opportunities for
>> > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > quite long.
>> >
>> >
>> >>
>> >> I hope the benefits are obvious: consolidating efforts on unit
>> >> testing, benchmarking, performance optimizations, continuous
>> >> integration, and platform compatibility.
>> >>
>> >> Logistically speaking, one possible avenue might be to use Apache
>> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> >> small, and it builds and installs fast. It is intended as a library to
>> >> have its headers used and linked against other applications. (As an
>> >> aside, I'm very interested in building optional support for Arrow
>> >> columnar messages into the kudu client).
>> >>
>> >
>> > In principle I'm in favour of code sharing, and it seems very much in
>> > keeping with the Apache way. However, practically speaking I'm of the
>> > opinion that it only makes sense to house shared support code in a
>> > separate, dedicated project.
>> >
>> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
>> > of sharing to utilities that Arrow is interested in. It would make no
>> sense
>> > to add a threading library to Arrow if it was never used natively.
>> Muddying
>> > the waters of the project's charter seems likely to lead to user, and
>> > developer, confusion. Similarly, we should not necessarily couple Arrow's
>> > design goals to those it inherits from Kudu and Impala's source code.
>> >
>> > I think I'd rather see a new Apache project than re-use a current one for
>> > two independent purposes.
>> >
>> >
>> >>
>> >> The downside of code sharing, which may have prevented it so far, are
>> >> the logistics of coordinating ASF release cycles and keeping build
>> >> toolchains in sync. It's taken us the past year to stabilize the
>> >> design of Arrow for its intended use cases, so at this point if we
>> >> went down this road I would be OK with helping the community commit to
>> >> a regular release cadence that would be faster than Impala, Kudu, and
>> >> Parquet's respective release cadences. Since members of the Kudu and
>> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> >> collaborate to each other's mutual benefit and success.
>> >>
>> >> Note that Arrow does not throw C++ exceptions and similarly follows
>> >> Google C++ style guide to the same extent at Kudu and Impala.
>> >>
>> >> If this is something that either the Kudu or Impala communities would
>> >> like to pursue in earnest, I would be happy to work with you on next
>> >> steps. I would suggest that we start with something small so that we
>> >> could address the necessary build toolchain changes, and develop a
>> >> workflow for moving around code and tests, a protocol for code reviews
>> >> (e.g. Gerrit), and coordinating ASF releases.
>> >>
>> >
>> > I think, if I'm reading this correctly, that you're assuming integration
>> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > their toolchains. For something as fast moving as utility code - and
>> > critical, where you want the latency between adding a fix and including
>> it
>> > in your build to be ~0 - that's a non-starter to me, at least with how
>> the
>> > toolchains are currently realised.
>> >
>> > I'd rather have the source code directly imported into Impala's tree -
>> > whether by git submodule or other mechanism. That way the coupling is
>> > looser, and we can move more quickly. I think that's important to other
>> > projects as well.
>> >
>> > Henry
>> >
>> >
>> >
>> >>
>> >> Let me know what you think.
>> >>
>> >> best
>> >> Wes
>> >>
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Miki,

No, I don't think so. APR is a portable C library. The code we are
talking about would be intended for use in C++11/14 projects like
Impala and Kudu (and Arrow and Parquet).

Wes

On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <mi...@gmail.com> wrote:
> Can't some (most) of it be added to APR <https://apr.apache.org/>?
>
> On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Henry,
>>
>> Thank you for these comments.
>>
>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>> ideal (though perhaps initially more labor intensive) solution.
>> There's code in Arrow that I would move into this project if it
>> existed. I am happy to help make this happen if there is interest from
>> the Kudu and Impala communities. I am not sure logistically what would
>> be the most expedient way to establish the project, whether as an ASF
>> Incubator project or possibly as a new TLP that could be created by
>> spinning IP out of Apache Kudu.
>>
>> I'm interested to hear the opinions of others, and possible next steps.
>>
>> Thanks
>> Wes
>>
>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
>> > Thanks for bringing this up, Wes.
>> >
>> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>> >
>> >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> >>
>> >> (I'm not sure the best way to have a cross-list discussion, so I
>> >> apologize if this does not work well)
>> >>
>> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> >> between the codebases in Apache Arrow and Apache Parquet, and
>> >> opportunities for more code sharing with Kudu and Impala as well.
>> >>
>> >> As context
>> >>
>> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> >> first C++ release within Apache Parquet. I got involved with this
>> >> project a little over a year ago and was faced with the unpleasant
>> >> decision to copy and paste a significant amount of code out of
>> >> Impala's codebase to bootstrap the project.
>> >>
>> >> * In parallel, we begin the Apache Arrow project, which is designed to
>> >> be a complementary library for file formats (like Parquet), storage
>> >> engines (like Kudu), and compute engines (like Impala and pandas).
>> >>
>> >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> >> overlap crept up surrounding buffer memory management and IO
>> >> interface. We recently decided in PARQUET-818
>> >> (https://github.com/apache/parquet-cpp/commit/
>> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> >> to remove some of the obvious code overlap in Parquet and make
>> >> libarrow.a/so a hard compile and link-time dependency for
>> >> libparquet.a/so.
>> >>
>> >> * There is still quite a bit of code in parquet-cpp that would better
>> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> >> compression, bit utilities, and so forth. Much of this code originated
>> >> from Impala
>> >>
>> >> This brings me to a next set of points:
>> >>
>> >> * parquet-cpp contains quite a bit of code that was extracted from
>> >> Impala. This is mostly self-contained in
>> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> >>
>> >> * My understanding is that Kudu extracted certain computational
>> >> utilities from Impala in its early days, but these tools have likely
>> >> diverged as the needs of the projects have evolved.
>> >>
>> >> Since all of these projects are quite different in their end goals
>> >> (runtime systems vs. libraries), touching code that is tightly coupled
>> >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> >> However, I think there is a strong basis for collaboration on
>> >> computational utilities and vectorized array processing. Some obvious
>> >> areas that come to mind:
>> >>
>> >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> >> memory)
>> >> * Array encoding utilities: RLE / Dictionary, etc.
>> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> >> contributed a patch to parquet-cpp around this)
>> >> * Date and time utilities
>> >> * Compression utilities
>> >>
>> >
>> > Between Kudu and Impala (at least) there are many more opportunities for
>> > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > quite long.
>> >
>> >
>> >>
>> >> I hope the benefits are obvious: consolidating efforts on unit
>> >> testing, benchmarking, performance optimizations, continuous
>> >> integration, and platform compatibility.
>> >>
>> >> Logistically speaking, one possible avenue might be to use Apache
>> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> >> small, and it builds and installs fast. It is intended as a library to
>> >> have its headers used and linked against other applications. (As an
>> >> aside, I'm very interested in building optional support for Arrow
>> >> columnar messages into the kudu client).
>> >>
>> >
>> > In principle I'm in favour of code sharing, and it seems very much in
>> > keeping with the Apache way. However, practically speaking I'm of the
>> > opinion that it only makes sense to house shared support code in a
>> > separate, dedicated project.
>> >
>> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
>> > of sharing to utilities that Arrow is interested in. It would make no
>> sense
>> > to add a threading library to Arrow if it was never used natively.
>> Muddying
>> > the waters of the project's charter seems likely to lead to user, and
>> > developer, confusion. Similarly, we should not necessarily couple Arrow's
>> > design goals to those it inherits from Kudu and Impala's source code.
>> >
>> > I think I'd rather see a new Apache project than re-use a current one for
>> > two independent purposes.
>> >
>> >
>> >>
>> >> The downside of code sharing, which may have prevented it so far, are
>> >> the logistics of coordinating ASF release cycles and keeping build
>> >> toolchains in sync. It's taken us the past year to stabilize the
>> >> design of Arrow for its intended use cases, so at this point if we
>> >> went down this road I would be OK with helping the community commit to
>> >> a regular release cadence that would be faster than Impala, Kudu, and
>> >> Parquet's respective release cadences. Since members of the Kudu and
>> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> >> collaborate to each other's mutual benefit and success.
>> >>
>> >> Note that Arrow does not throw C++ exceptions and similarly follows
>> >> Google C++ style guide to the same extent at Kudu and Impala.
>> >>
>> >> If this is something that either the Kudu or Impala communities would
>> >> like to pursue in earnest, I would be happy to work with you on next
>> >> steps. I would suggest that we start with something small so that we
>> >> could address the necessary build toolchain changes, and develop a
>> >> workflow for moving around code and tests, a protocol for code reviews
>> >> (e.g. Gerrit), and coordinating ASF releases.
>> >>
>> >
>> > I think, if I'm reading this correctly, that you're assuming integration
>> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > their toolchains. For something as fast moving as utility code - and
>> > critical, where you want the latency between adding a fix and including
>> it
>> > in your build to be ~0 - that's a non-starter to me, at least with how
>> the
>> > toolchains are currently realised.
>> >
>> > I'd rather have the source code directly imported into Impala's tree -
>> > whether by git submodule or other mechanism. That way the coupling is
>> > looser, and we can move more quickly. I think that's important to other
>> > projects as well.
>> >
>> > Henry
>> >
>> >
>> >
>> >>
>> >> Let me know what you think.
>> >>
>> >> best
>> >> Wes
>> >>
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Miki Tebeka <mi...@gmail.com>.

Can't some (most) of it be added to APR <https://apr.apache.org/>?

On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Julian Hyde <jh...@apache.org>.

Yes. Since an Apache project is a community, it is very easy for it to produce more than one piece of code.


> On Feb 27, 2017, at 10:34 AM, Leif Walsh <le...@gmail.com> wrote:
> 
> Julian, are you proposing the arrow project ship two artifacts,
> arrow-common and arrow, where arrow depends on arrow-common?
> On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jhyde@apache.org <ma...@apache.org>> wrote:
> 
>> “Commons” projects are often problematic. It is difficult to tell what is
>> in scope and out of scope. If the scope is drawn too wide, there is a real
>> problem of orphaned features, because people contribute one feature and
>> then disappear.
>> 
>> Let’s remember the Apache mantra: community over code. If you create a
>> sustainable community, the code will get looked after. Would this project
>> form a new community, or just a new piece of code? As I read the current
>> proposal, it would be the intersection of some existing communities, not a
>> new community.
>> 
>> I think it would take a considerable effort to create a new project and
>> community around the idea of “c++ commons” (or is it “database-related c++
>> commons”?). I think you already have such a community, to a first
>> approximation, in the Arrow project, because Kudu and Impala developers are
>> already part of the Arrow community. There’s no reason why Arrow cannot
>> contain new modules that have different release schedules than the rest of
>> Arrow. As a TLP, releases are less burdensome, and can happen in a little
>> over 3 days if the component is kept stable.
>> 
>> Lastly, the code is fungible. It can be marked “experimental” within Arrow
>> and moved to another project, or into a new project, as it matures. The
>> Apache license and the ASF CLA makes this very easy. We are doing something
>> like this in Calcite: the Avatica sub-project [1] has a community that
>> intersect’s with Calcite’s, is disconnected at a code level, and may over
>> time evolve into a separate project. In the mean time, being part of an
>> established project is helpful, because there are PMC members to vote.
>> 
>> Julian
>> 
>> [1] https://calcite.apache.org/avatica/ <
>> https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>>
>> 
>>> On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>> Responding to Todd's e-mail:
>>> 
>>> 1) Open source release model
>>> 
>>> My expectation is that this library would release about once a month,
>>> with occasional faster releases for critical fixes.
>>> 
>>> 2) Governance/review model
>>> 
>>> Beyond having centralized code reviews, it's hard to predict how the
>>> governance would play out. I understand that OSS projects behave
>>> differently in their planning / design / review process, so work on a
>>> common need may require more of a negotiation than the prior
>>> "unilateral" process.
>>> 
>>> I think it says something for our communities that we would make a
>>> commitment in our collaboration on this to the success of the
>>> "consumer" projects. So if the Arrow or Parquet communities were
>>> contemplating a change that might impact Kudu, for example, it would
>>> be in our best interest to be careful and communicate proactively.
>>> 
>>> This all makes sense. From an Arrow and Parquet perspective, we do not
>>> add very much testing burden because our continuous integration suites
>>> do not take long to run.
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> One thing that would help would be community-maintained
>>> Dockerfiles/Docker images (or equivalent) to assist with validation
>>> and testing for developers.
>>> 
>>> I am happy to comply with a pre-commit testing protocol that works for
>>> the Kudu and Impala teams.
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>>> One option is that each "user" of the libraries manually "rolls" to new
>> versions when they feel like it, but there's still now a case where a
>> common change "pushes work onto" the consumers to update call sites, etc.
>>> 
>>> Breaking API changes will create extra work, because any automated
>>> testing that we create will not be able to validate the patch to the
>>> common library. Perhaps we can configure a manual way (in Jenkins,
>>> say) to test two patches together.
>>> 
>>> In the event that a community member has a patch containing an API
>>> break that impacts a project that they are not a contributor for,
>>> there should be some expectation to either work with the affected
>>> project on a coordinated patch or obtain their +1 to merge the patch
>>> even though it will may require a follow up patch if the roll-forward
>>> in the consumer project exposes bugs in the common library. There may
>>> be situations like:
>>> 
>>> * Kudu changes API in $COMMON that impacts Arrow
>>> * Arrow says +1, we will roll forward $COMMON later
>>> * Patch merged
>>> * Arrow rolls forward, discovers bug caused by patch in $COMMON
>>> * Arrow proposes patch to $COMMON
>>> * ...
>>> 
>>> This is the worst case scenario, of course, but I actually think it is
>>> good because it would indicate that the unit testing in $COMMON needs
>>> to be improved. Unit testing in the common library, therefore, would
>>> take on more of a "defensive" quality than currently.
>>> 
>>> In any case, I'm keen to move forward to coming up with a concrete
>>> plan if we can reach consensus on the particulars.
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com>
>> wrote:
>>>> I also support the idea of creating an "apache commons modern c++" style
>>>> library, maybe tailored toward the needs of columnar data processing
>>>> tools.  I think APR is the wrong project but I think that *style* of
>>>> project is the right direction to aim.
>>>> 
>>>> I agree this adds test and release process complexity across products
>> but I
>>>> think the benefits of a shared, well-tested library outweigh that, and
>>>> creating such test infrastructure will have long-term benefits as well.
>>>> 
>>>> I'd be happy to lend a hand wherever it's needed.
>>>> 
>>>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>>>> 
>>>>> Hey folks,
>>>>> 
>>>>> As Henry mentioned, Impala is starting to share more code with Kudu
>> (most
>>>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>>>> well), so we've been chatting periodically offline about the best way
>> to do
>>>>> this. Having more projects potentially interested in collaborating is
>>>>> definitely welcome, though I think does also increase the complexity of
>>>>> whatever solution we come up with.
>>>>> 
>>>>> I think the potential benefits of collaboration are fairly
>> self-evident, so
>>>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>>>> 
>>>>> 1) Open source release model
>>>>> 
>>>>> The ASF is very much against having projects which do not do releases.
>> So,
>>>>> if we were to create some new ASF project to hold this code, we'd be
>>>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>>>> frequent releases, but we actually need at least 3 PMC members to vote
>> on
>>>>> each release, and given people can come and go, we'd probably need at
>> least
>>>>> 5-8 people who are actively committed to helping with the release
>> process
>>>>> of this "commons" project.
>>>>> 
>>>>> Unlike our existing projects, which seem to release every 2-3 months,
>> if
>>>>> that, I think this one would have to release _much_ more frequently,
>> if we
>>>>> expect downstream projects to depend on released versions rather than
>> just
>>>>> pulling in some recent (or even trunk) git hash. Since the ASF
>> requires the
>>>>> normal voting period and process for every release, I don't think we
>> could
>>>>> do something like have "daily automatic releases", etc.
>>>>> 
>>>>> We could probably campaign the ASF membership to treat this project
>>>>> differently, either as (a) a repository of code that never releases, in
>>>>> which case the "downstream" projects are responsible for vetting IP,
>> etc,
>>>>> as part of their own release processes, or (b) a project which does
>>>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>>>> palatable from an IP perspective, and also from the perspective of the
>>>>> downstream projects.
>>>>> 
>>>>> 
>>>>> 2) Governance/review model
>>>>> 
>>>>> The more projects there are sharing this common code, the more
>> difficult it
>>>>> is to know whether a change would break something, or even whether a
>> change
>>>>> is considered desirable for all of the projects. I don't want to get
>> into
>>>>> some world where any change to a central library requires a multi-week
>>>>> proposal/design-doc/review across 3+ different groups of committers,
>> all of
>>>>> whom may have different near-term priorities. On the other hand, it
>> would
>>>>> be pretty frustrating if the week before we're trying to cut a Kudu
>> release
>>>>> branch, someone in another community decides to make a potentially
>>>>> destabilizing change to the RPC library.
>>>>> 
>>>>> 
>>>>> 3) Pre-commit/test mechanics
>>>>> 
>>>>> Semi-related to the above: we currently feel pretty confident when we
>> make
>>>>> a change to a central library like kudu/util/thread.cc that nothing
>> broke
>>>>> because we run the full suite of Kudu tests. Of course the central
>>>>> libraries have some unit test coverage, but I wouldn't be confident
>> with
>>>>> any sort of model where shared code can change without verification by
>> a
>>>>> larger suite of tests.
>>>>> 
>>>>> On the other hand, I also don't want to move to a model where any
>> change to
>>>>> shared code requires a 6+-hour precommit spanning several projects,
>> each of
>>>>> which may have its own set of potentially-flaky pre-commit tests, etc.
>> I
>>>>> can imagine that if an Arrow developer made some change to "thread.cc"
>> and
>>>>> saw that TabletServerStressTest failed their precommit, they'd have no
>> idea
>>>>> how to triage it, etc. That could be a strong disincentive to continued
>>>>> innovation in these areas of common code, which we'll need a good way
>> to
>>>>> avoid.
>>>>> 
>>>>> I think some of the above could be ameliorated with really good
>>>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>>>> significance in flakiness level, etc. But, that's a lot of
>> infrastructure
>>>>> that doesn't currently exist.
>>>>> 
>>>>> 
>>>>> 4) Integration mechanics for breaking changes
>>>>> 
>>>>> Currently these common libraries are treated as components of
>> monolithic
>>>>> projects. That means it's no extra overhead for us to make some kind of
>>>>> change which breaks an API in src/kudu/util/ and at the same time
>> updates
>>>>> all call sites. The internal libraries have no semblance of API
>>>>> compatibility guarantees, etc, and adding one is not without cost.
>>>>> 
>>>>> Before sharing code, we should figure out how exactly we'll manage the
>>>>> cases where we want to make some change in a common library that
>> breaks an
>>>>> API used by other projects, given there's no way to make an atomic
>> commit
>>>>> across many repositories. One option is that each "user" of the
>> libraries
>>>>> manually "rolls" to new versions when they feel like it, but there's
>> still
>>>>> now a case where a common change "pushes work onto" the consumers to
>> update
>>>>> call sites, etc.
>>>>> 
>>>>> Admittedly, the number of breaking API changes in these common
>> libraries is
>>>>> relatively small, but would still be good to understand how we would
>> plan
>>>>> to manage them.
>>>>> 
>>>>> -Todd
>>>>> 
>>>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> hi Henry,
>>>>>> 
>>>>>> Thank you for these comments.
>>>>>> 
>>>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>>>> ideal (though perhaps initially more labor intensive) solution.
>>>>>> There's code in Arrow that I would move into this project if it
>>>>>> existed. I am happy to help make this happen if there is interest from
>>>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>>>> be the most expedient way to establish the project, whether as an ASF
>>>>>> Incubator project or possibly as a new TLP that could be created by
>>>>>> spinning IP out of Apache Kudu.
>>>>>> 
>>>>>> I'm interested to hear the opinions of others, and possible next
>> steps.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>>>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>>>> wrote:
>>>>>>> Thanks for bringing this up, Wes.
>>>>>>> 
>>>>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>>>> 
>>>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>>>> apologize if this does not work well)
>>>>>>>> 
>>>>>>>> On the recent Apache Parquet sync call, we discussed C++ code
>> sharing
>>>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>>>> 
>>>>>>>> As context
>>>>>>>> 
>>>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>>>> decision to copy and paste a significant amount of code out of
>>>>>>>> Impala's codebase to bootstrap the project.
>>>>>>>> 
>>>>>>>> * In parallel, we begin the Apache Arrow project, which is designed
>> to
>>>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>>>> 
>>>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>>>> interface. We recently decided in PARQUET-818
>>>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>>>> libparquet.a/so.
>>>>>>>> 
>>>>>>>> * There is still quite a bit of code in parquet-cpp that would
>> better
>>>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
>> encoding,
>>>>>>>> compression, bit utilities, and so forth. Much of this code
>> originated
>>>>>>>> from Impala
>>>>>>>> 
>>>>>>>> This brings me to a next set of points:
>>>>>>>> 
>>>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>>>> Impala. This is mostly self-contained in
>>>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>>>> 
>>>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>>>> diverged as the needs of the projects have evolved.
>>>>>>>> 
>>>>>>>> Since all of these projects are quite different in their end goals
>>>>>>>> (runtime systems vs. libraries), touching code that is tightly
>> coupled
>>>>>>>> to either Kudu or Impala's runtimes is probably not worth
>> discussing.
>>>>>>>> However, I think there is a strong basis for collaboration on
>>>>>>>> computational utilities and vectorized array processing. Some
>> obvious
>>>>>>>> areas that come to mind:
>>>>>>>> 
>>>>>>>> * SIMD utilities (for hashing or processing of preallocated
>> contiguous
>>>>>>>> memory)
>>>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>>>> contributed a patch to parquet-cpp around this)
>>>>>>>> * Date and time utilities
>>>>>>>> * Compression utilities
>>>>>>>> 
>>>>>>> 
>>>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>>>> for
>>>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
>> is
>>>>>>> quite long.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>>>> integration, and platform compatibility.
>>>>>>>> 
>>>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
>> is
>>>>>>>> small, and it builds and installs fast. It is intended as a library
>> to
>>>>>>>> have its headers used and linked against other applications. (As an
>>>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>>>> columnar messages into the kudu client).
>>>>>>>> 
>>>>>>> 
>>>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>>>> opinion that it only makes sense to house shared support code in a
>>>>>>> separate, dedicated project.
>>>>>>> 
>>>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>>>> scope
>>>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>>>> sense
>>>>>>> to add a threading library to Arrow if it was never used natively.
>>>>>> Muddying
>>>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>>>> developer, confusion. Similarly, we should not necessarily couple
>>>>> Arrow's
>>>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>>>> 
>>>>>>> I think I'd rather see a new Apache project than re-use a current one
>>>>> for
>>>>>>> two independent purposes.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> The downside of code sharing, which may have prevented it so far,
>> are
>>>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>>>> went down this road I would be OK with helping the community commit
>> to
>>>>>>>> a regular release cadence that would be faster than Impala, Kudu,
>> and
>>>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>>>> collaborate to each other's mutual benefit and success.
>>>>>>>> 
>>>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>>>> 
>>>>>>>> If this is something that either the Kudu or Impala communities
>> would
>>>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>>>> steps. I would suggest that we start with something small so that we
>>>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>>>> workflow for moving around code and tests, a protocol for code
>> reviews
>>>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>>>> 
>>>>>>> 
>>>>>>> I think, if I'm reading this correctly, that you're assuming
>>>>> integration
>>>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
>> via
>>>>>>> their toolchains. For something as fast moving as utility code - and
>>>>>>> critical, where you want the latency between adding a fix and
>> including
>>>>>> it
>>>>>>> in your build to be ~0 - that's a non-starter to me, at least with
>> how
>>>>>> the
>>>>>>> toolchains are currently realised.
>>>>>>> 
>>>>>>> I'd rather have the source code directly imported into Impala's tree
>> -
>>>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>>>> looser, and we can move more quickly. I think that's important to
>> other
>>>>>>> projects as well.
>>>>>>> 
>>>>>>> Henry
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> Let me know what you think.
>>>>>>>> 
>>>>>>>> best
>>>>>>>> Wes
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>> 
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>> 
>> --
> -- 
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:

> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm keen to move forward to coming up with a concrete
> > plan if we can reach consensus on the particulars.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com>
> wrote:
> >> I also support the idea of creating an "apache commons modern c++" style
> >> library, maybe tailored toward the needs of columnar data processing
> >> tools.  I think APR is the wrong project but I think that *style* of
> >> project is the right direction to aim.
> >>
> >> I agree this adds test and release process complexity across products
> but I
> >> think the benefits of a shared, well-tested library outweigh that, and
> >> creating such test infrastructure will have long-term benefits as well.
> >>
> >> I'd be happy to lend a hand wherever it's needed.
> >>
> >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> As Henry mentioned, Impala is starting to share more code with Kudu
> (most
> >>> notably our RPC system, but that pulls in a fair bit of utility code as
> >>> well), so we've been chatting periodically offline about the best way
> to do
> >>> this. Having more projects potentially interested in collaborating is
> >>> definitely welcome, though I think does also increase the complexity of
> >>> whatever solution we come up with.
> >>>
> >>> I think the potential benefits of collaboration are fairly
> self-evident, so
> >>> I'll focus on my concerns here, which somewhat echo Henry's.
> >>>
> >>> 1) Open source release model
> >>>
> >>> The ASF is very much against having projects which do not do releases.
> So,
> >>> if we were to create some new ASF project to hold this code, we'd be
> >>> expected to do frequent releases thereof. Wes volunteered above to lead
> >>> frequent releases, but we actually need at least 3 PMC members to vote
> on
> >>> each release, and given people can come and go, we'd probably need at
> least
> >>> 5-8 people who are actively committed to helping with the release
> process
> >>> of this "commons" project.
> >>>
> >>> Unlike our existing projects, which seem to release every 2-3 months,
> if
> >>> that, I think this one would have to release _much_ more frequently,
> if we
> >>> expect downstream projects to depend on released versions rather than
> just
> >>> pulling in some recent (or even trunk) git hash. Since the ASF
> requires the
> >>> normal voting period and process for every release, I don't think we
> could
> >>> do something like have "daily automatic releases", etc.
> >>>
> >>> We could probably campaign the ASF membership to treat this project
> >>> differently, either as (a) a repository of code that never releases, in
> >>> which case the "downstream" projects are responsible for vetting IP,
> etc,
> >>> as part of their own release processes, or (b) a project which does
> >>> automatic releases voted upon by robots. I'm guessing that (a) is more
> >>> palatable from an IP perspective, and also from the perspective of the
> >>> downstream projects.
> >>>
> >>>
> >>> 2) Governance/review model
> >>>
> >>> The more projects there are sharing this common code, the more
> difficult it
> >>> is to know whether a change would break something, or even whether a
> change
> >>> is considered desirable for all of the projects. I don't want to get
> into
> >>> some world where any change to a central library requires a multi-week
> >>> proposal/design-doc/review across 3+ different groups of committers,
> all of
> >>> whom may have different near-term priorities. On the other hand, it
> would
> >>> be pretty frustrating if the week before we're trying to cut a Kudu
> release
> >>> branch, someone in another community decides to make a potentially
> >>> destabilizing change to the RPC library.
> >>>
> >>>
> >>> 3) Pre-commit/test mechanics
> >>>
> >>> Semi-related to the above: we currently feel pretty confident when we
> make
> >>> a change to a central library like kudu/util/thread.cc that nothing
> broke
> >>> because we run the full suite of Kudu tests. Of course the central
> >>> libraries have some unit test coverage, but I wouldn't be confident
> with
> >>> any sort of model where shared code can change without verification by
> a
> >>> larger suite of tests.
> >>>
> >>> On the other hand, I also don't want to move to a model where any
> change to
> >>> shared code requires a 6+-hour precommit spanning several projects,
> each of
> >>> which may have its own set of potentially-flaky pre-commit tests, etc.
> I
> >>> can imagine that if an Arrow developer made some change to "thread.cc"
> and
> >>> saw that TabletServerStressTest failed their precommit, they'd have no
> idea
> >>> how to triage it, etc. That could be a strong disincentive to continued
> >>> innovation in these areas of common code, which we'll need a good way
> to
> >>> avoid.
> >>>
> >>> I think some of the above could be ameliorated with really good
> >>> infrastructure -- eg on a test failure, automatically re-run the failed
> >>> test on both pre-patch and post-patch, do a t-test to check statistical
> >>> significance in flakiness level, etc. But, that's a lot of
> infrastructure
> >>> that doesn't currently exist.
> >>>
> >>>
> >>> 4) Integration mechanics for breaking changes
> >>>
> >>> Currently these common libraries are treated as components of
> monolithic
> >>> projects. That means it's no extra overhead for us to make some kind of
> >>> change which breaks an API in src/kudu/util/ and at the same time
> updates
> >>> all call sites. The internal libraries have no semblance of API
> >>> compatibility guarantees, etc, and adding one is not without cost.
> >>>
> >>> Before sharing code, we should figure out how exactly we'll manage the
> >>> cases where we want to make some change in a common library that
> breaks an
> >>> API used by other projects, given there's no way to make an atomic
> commit
> >>> across many repositories. One option is that each "user" of the
> libraries
> >>> manually "rolls" to new versions when they feel like it, but there's
> still
> >>> now a case where a common change "pushes work onto" the consumers to
> update
> >>> call sites, etc.
> >>>
> >>> Admittedly, the number of breaking API changes in these common
> libraries is
> >>> relatively small, but would still be good to understand how we would
> plan
> >>> to manage them.
> >>>
> >>> -Todd
> >>>
> >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>
> >>>> hi Henry,
> >>>>
> >>>> Thank you for these comments.
> >>>>
> >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
> >>>> ideal (though perhaps initially more labor intensive) solution.
> >>>> There's code in Arrow that I would move into this project if it
> >>>> existed. I am happy to help make this happen if there is interest from
> >>>> the Kudu and Impala communities. I am not sure logistically what would
> >>>> be the most expedient way to establish the project, whether as an ASF
> >>>> Incubator project or possibly as a new TLP that could be created by
> >>>> spinning IP out of Apache Kudu.
> >>>>
> >>>> I'm interested to hear the opinions of others, and possible next
> steps.
> >>>>
> >>>> Thanks
> >>>> Wes
> >>>>
> >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> >>> wrote:
> >>>>> Thanks for bringing this up, Wes.
> >>>>>
> >>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>>>>>
> >>>>>> (I'm not sure the best way to have a cross-list discussion, so I
> >>>>>> apologize if this does not work well)
> >>>>>>
> >>>>>> On the recent Apache Parquet sync call, we discussed C++ code
> sharing
> >>>>>> between the codebases in Apache Arrow and Apache Parquet, and
> >>>>>> opportunities for more code sharing with Kudu and Impala as well.
> >>>>>>
> >>>>>> As context
> >>>>>>
> >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >>>>>> first C++ release within Apache Parquet. I got involved with this
> >>>>>> project a little over a year ago and was faced with the unpleasant
> >>>>>> decision to copy and paste a significant amount of code out of
> >>>>>> Impala's codebase to bootstrap the project.
> >>>>>>
> >>>>>> * In parallel, we begin the Apache Arrow project, which is designed
> to
> >>>>>> be a complementary library for file formats (like Parquet), storage
> >>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
> >>>>>>
> >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
> >>>>>> overlap crept up surrounding buffer memory management and IO
> >>>>>> interface. We recently decided in PARQUET-818
> >>>>>> (https://github.com/apache/parquet-cpp/commit/
> >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >>>>>> to remove some of the obvious code overlap in Parquet and make
> >>>>>> libarrow.a/so a hard compile and link-time dependency for
> >>>>>> libparquet.a/so.
> >>>>>>
> >>>>>> * There is still quite a bit of code in parquet-cpp that would
> better
> >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
> encoding,
> >>>>>> compression, bit utilities, and so forth. Much of this code
> originated
> >>>>>> from Impala
> >>>>>>
> >>>>>> This brings me to a next set of points:
> >>>>>>
> >>>>>> * parquet-cpp contains quite a bit of code that was extracted from
> >>>>>> Impala. This is mostly self-contained in
> >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>>>>>
> >>>>>> * My understanding is that Kudu extracted certain computational
> >>>>>> utilities from Impala in its early days, but these tools have likely
> >>>>>> diverged as the needs of the projects have evolved.
> >>>>>>
> >>>>>> Since all of these projects are quite different in their end goals
> >>>>>> (runtime systems vs. libraries), touching code that is tightly
> coupled
> >>>>>> to either Kudu or Impala's runtimes is probably not worth
> discussing.
> >>>>>> However, I think there is a strong basis for collaboration on
> >>>>>> computational utilities and vectorized array processing. Some
> obvious
> >>>>>> areas that come to mind:
> >>>>>>
> >>>>>> * SIMD utilities (for hashing or processing of preallocated
> contiguous
> >>>>>> memory)
> >>>>>> * Array encoding utilities: RLE / Dictionary, etc.
> >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >>>>>> contributed a patch to parquet-cpp around this)
> >>>>>> * Date and time utilities
> >>>>>> * Compression utilities
> >>>>>>
> >>>>>
> >>>>> Between Kudu and Impala (at least) there are many more opportunities
> >>> for
> >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
> is
> >>>>> quite long.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> I hope the benefits are obvious: consolidating efforts on unit
> >>>>>> testing, benchmarking, performance optimizations, continuous
> >>>>>> integration, and platform compatibility.
> >>>>>>
> >>>>>> Logistically speaking, one possible avenue might be to use Apache
> >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
> is
> >>>>>> small, and it builds and installs fast. It is intended as a library
> to
> >>>>>> have its headers used and linked against other applications. (As an
> >>>>>> aside, I'm very interested in building optional support for Arrow
> >>>>>> columnar messages into the kudu client).
> >>>>>>
> >>>>>
> >>>>> In principle I'm in favour of code sharing, and it seems very much in
> >>>>> keeping with the Apache way. However, practically speaking I'm of the
> >>>>> opinion that it only makes sense to house shared support code in a
> >>>>> separate, dedicated project.
> >>>>>
> >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
> >>> scope
> >>>>> of sharing to utilities that Arrow is interested in. It would make no
> >>>> sense
> >>>>> to add a threading library to Arrow if it was never used natively.
> >>>> Muddying
> >>>>> the waters of the project's charter seems likely to lead to user, and
> >>>>> developer, confusion. Similarly, we should not necessarily couple
> >>> Arrow's
> >>>>> design goals to those it inherits from Kudu and Impala's source code.
> >>>>>
> >>>>> I think I'd rather see a new Apache project than re-use a current one
> >>> for
> >>>>> two independent purposes.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The downside of code sharing, which may have prevented it so far,
> are
> >>>>>> the logistics of coordinating ASF release cycles and keeping build
> >>>>>> toolchains in sync. It's taken us the past year to stabilize the
> >>>>>> design of Arrow for its intended use cases, so at this point if we
> >>>>>> went down this road I would be OK with helping the community commit
> to
> >>>>>> a regular release cadence that would be faster than Impala, Kudu,
> and
> >>>>>> Parquet's respective release cadences. Since members of the Kudu and
> >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >>>>>> collaborate to each other's mutual benefit and success.
> >>>>>>
> >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
> >>>>>> Google C++ style guide to the same extent at Kudu and Impala.
> >>>>>>
> >>>>>> If this is something that either the Kudu or Impala communities
> would
> >>>>>> like to pursue in earnest, I would be happy to work with you on next
> >>>>>> steps. I would suggest that we start with something small so that we
> >>>>>> could address the necessary build toolchain changes, and develop a
> >>>>>> workflow for moving around code and tests, a protocol for code
> reviews
> >>>>>> (e.g. Gerrit), and coordinating ASF releases.
> >>>>>>
> >>>>>
> >>>>> I think, if I'm reading this correctly, that you're assuming
> >>> integration
> >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
> via
> >>>>> their toolchains. For something as fast moving as utility code - and
> >>>>> critical, where you want the latency between adding a fix and
> including
> >>>> it
> >>>>> in your build to be ~0 - that's a non-starter to me, at least with
> how
> >>>> the
> >>>>> toolchains are currently realised.
> >>>>>
> >>>>> I'd rather have the source code directly imported into Impala's tree
> -
> >>>>> whether by git submodule or other mechanism. That way the coupling is
> >>>>> looser, and we can move more quickly. I think that's important to
> other
> >>>>> projects as well.
> >>>>>
> >>>>> Henry
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Let me know what you think.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >> --
> >> --
> >> Cheers,
> >> Leif
>
> --
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:

> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm keen to move forward to coming up with a concrete
> > plan if we can reach consensus on the particulars.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com>
> wrote:
> >> I also support the idea of creating an "apache commons modern c++" style
> >> library, maybe tailored toward the needs of columnar data processing
> >> tools.  I think APR is the wrong project but I think that *style* of
> >> project is the right direction to aim.
> >>
> >> I agree this adds test and release process complexity across products
> but I
> >> think the benefits of a shared, well-tested library outweigh that, and
> >> creating such test infrastructure will have long-term benefits as well.
> >>
> >> I'd be happy to lend a hand wherever it's needed.
> >>
> >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> As Henry mentioned, Impala is starting to share more code with Kudu
> (most
> >>> notably our RPC system, but that pulls in a fair bit of utility code as
> >>> well), so we've been chatting periodically offline about the best way
> to do
> >>> this. Having more projects potentially interested in collaborating is
> >>> definitely welcome, though I think does also increase the complexity of
> >>> whatever solution we come up with.
> >>>
> >>> I think the potential benefits of collaboration are fairly
> self-evident, so
> >>> I'll focus on my concerns here, which somewhat echo Henry's.
> >>>
> >>> 1) Open source release model
> >>>
> >>> The ASF is very much against having projects which do not do releases.
> So,
> >>> if we were to create some new ASF project to hold this code, we'd be
> >>> expected to do frequent releases thereof. Wes volunteered above to lead
> >>> frequent releases, but we actually need at least 3 PMC members to vote
> on
> >>> each release, and given people can come and go, we'd probably need at
> least
> >>> 5-8 people who are actively committed to helping with the release
> process
> >>> of this "commons" project.
> >>>
> >>> Unlike our existing projects, which seem to release every 2-3 months,
> if
> >>> that, I think this one would have to release _much_ more frequently,
> if we
> >>> expect downstream projects to depend on released versions rather than
> just
> >>> pulling in some recent (or even trunk) git hash. Since the ASF
> requires the
> >>> normal voting period and process for every release, I don't think we
> could
> >>> do something like have "daily automatic releases", etc.
> >>>
> >>> We could probably campaign the ASF membership to treat this project
> >>> differently, either as (a) a repository of code that never releases, in
> >>> which case the "downstream" projects are responsible for vetting IP,
> etc,
> >>> as part of their own release processes, or (b) a project which does
> >>> automatic releases voted upon by robots. I'm guessing that (a) is more
> >>> palatable from an IP perspective, and also from the perspective of the
> >>> downstream projects.
> >>>
> >>>
> >>> 2) Governance/review model
> >>>
> >>> The more projects there are sharing this common code, the more
> difficult it
> >>> is to know whether a change would break something, or even whether a
> change
> >>> is considered desirable for all of the projects. I don't want to get
> into
> >>> some world where any change to a central library requires a multi-week
> >>> proposal/design-doc/review across 3+ different groups of committers,
> all of
> >>> whom may have different near-term priorities. On the other hand, it
> would
> >>> be pretty frustrating if the week before we're trying to cut a Kudu
> release
> >>> branch, someone in another community decides to make a potentially
> >>> destabilizing change to the RPC library.
> >>>
> >>>
> >>> 3) Pre-commit/test mechanics
> >>>
> >>> Semi-related to the above: we currently feel pretty confident when we
> make
> >>> a change to a central library like kudu/util/thread.cc that nothing
> broke
> >>> because we run the full suite of Kudu tests. Of course the central
> >>> libraries have some unit test coverage, but I wouldn't be confident
> with
> >>> any sort of model where shared code can change without verification by
> a
> >>> larger suite of tests.
> >>>
> >>> On the other hand, I also don't want to move to a model where any
> change to
> >>> shared code requires a 6+-hour precommit spanning several projects,
> each of
> >>> which may have its own set of potentially-flaky pre-commit tests, etc.
> I
> >>> can imagine that if an Arrow developer made some change to "thread.cc"
> and
> >>> saw that TabletServerStressTest failed their precommit, they'd have no
> idea
> >>> how to triage it, etc. That could be a strong disincentive to continued
> >>> innovation in these areas of common code, which we'll need a good way
> to
> >>> avoid.
> >>>
> >>> I think some of the above could be ameliorated with really good
> >>> infrastructure -- eg on a test failure, automatically re-run the failed
> >>> test on both pre-patch and post-patch, do a t-test to check statistical
> >>> significance in flakiness level, etc. But, that's a lot of
> infrastructure
> >>> that doesn't currently exist.
> >>>
> >>>
> >>> 4) Integration mechanics for breaking changes
> >>>
> >>> Currently these common libraries are treated as components of
> monolithic
> >>> projects. That means it's no extra overhead for us to make some kind of
> >>> change which breaks an API in src/kudu/util/ and at the same time
> updates
> >>> all call sites. The internal libraries have no semblance of API
> >>> compatibility guarantees, etc, and adding one is not without cost.
> >>>
> >>> Before sharing code, we should figure out how exactly we'll manage the
> >>> cases where we want to make some change in a common library that
> breaks an
> >>> API used by other projects, given there's no way to make an atomic
> commit
> >>> across many repositories. One option is that each "user" of the
> libraries
> >>> manually "rolls" to new versions when they feel like it, but there's
> still
> >>> now a case where a common change "pushes work onto" the consumers to
> update
> >>> call sites, etc.
> >>>
> >>> Admittedly, the number of breaking API changes in these common
> libraries is
> >>> relatively small, but would still be good to understand how we would
> plan
> >>> to manage them.
> >>>
> >>> -Todd
> >>>
> >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>
> >>>> hi Henry,
> >>>>
> >>>> Thank you for these comments.
> >>>>
> >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
> >>>> ideal (though perhaps initially more labor intensive) solution.
> >>>> There's code in Arrow that I would move into this project if it
> >>>> existed. I am happy to help make this happen if there is interest from
> >>>> the Kudu and Impala communities. I am not sure logistically what would
> >>>> be the most expedient way to establish the project, whether as an ASF
> >>>> Incubator project or possibly as a new TLP that could be created by
> >>>> spinning IP out of Apache Kudu.
> >>>>
> >>>> I'm interested to hear the opinions of others, and possible next
> steps.
> >>>>
> >>>> Thanks
> >>>> Wes
> >>>>
> >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> >>> wrote:
> >>>>> Thanks for bringing this up, Wes.
> >>>>>
> >>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>>>>>
> >>>>>> (I'm not sure the best way to have a cross-list discussion, so I
> >>>>>> apologize if this does not work well)
> >>>>>>
> >>>>>> On the recent Apache Parquet sync call, we discussed C++ code
> sharing
> >>>>>> between the codebases in Apache Arrow and Apache Parquet, and
> >>>>>> opportunities for more code sharing with Kudu and Impala as well.
> >>>>>>
> >>>>>> As context
> >>>>>>
> >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >>>>>> first C++ release within Apache Parquet. I got involved with this
> >>>>>> project a little over a year ago and was faced with the unpleasant
> >>>>>> decision to copy and paste a significant amount of code out of
> >>>>>> Impala's codebase to bootstrap the project.
> >>>>>>
> >>>>>> * In parallel, we begin the Apache Arrow project, which is designed
> to
> >>>>>> be a complementary library for file formats (like Parquet), storage
> >>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
> >>>>>>
> >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
> >>>>>> overlap crept up surrounding buffer memory management and IO
> >>>>>> interface. We recently decided in PARQUET-818
> >>>>>> (https://github.com/apache/parquet-cpp/commit/
> >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >>>>>> to remove some of the obvious code overlap in Parquet and make
> >>>>>> libarrow.a/so a hard compile and link-time dependency for
> >>>>>> libparquet.a/so.
> >>>>>>
> >>>>>> * There is still quite a bit of code in parquet-cpp that would
> better
> >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
> encoding,
> >>>>>> compression, bit utilities, and so forth. Much of this code
> originated
> >>>>>> from Impala
> >>>>>>
> >>>>>> This brings me to a next set of points:
> >>>>>>
> >>>>>> * parquet-cpp contains quite a bit of code that was extracted from
> >>>>>> Impala. This is mostly self-contained in
> >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>>>>>
> >>>>>> * My understanding is that Kudu extracted certain computational
> >>>>>> utilities from Impala in its early days, but these tools have likely
> >>>>>> diverged as the needs of the projects have evolved.
> >>>>>>
> >>>>>> Since all of these projects are quite different in their end goals
> >>>>>> (runtime systems vs. libraries), touching code that is tightly
> coupled
> >>>>>> to either Kudu or Impala's runtimes is probably not worth
> discussing.
> >>>>>> However, I think there is a strong basis for collaboration on
> >>>>>> computational utilities and vectorized array processing. Some
> obvious
> >>>>>> areas that come to mind:
> >>>>>>
> >>>>>> * SIMD utilities (for hashing or processing of preallocated
> contiguous
> >>>>>> memory)
> >>>>>> * Array encoding utilities: RLE / Dictionary, etc.
> >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >>>>>> contributed a patch to parquet-cpp around this)
> >>>>>> * Date and time utilities
> >>>>>> * Compression utilities
> >>>>>>
> >>>>>
> >>>>> Between Kudu and Impala (at least) there are many more opportunities
> >>> for
> >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
> is
> >>>>> quite long.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> I hope the benefits are obvious: consolidating efforts on unit
> >>>>>> testing, benchmarking, performance optimizations, continuous
> >>>>>> integration, and platform compatibility.
> >>>>>>
> >>>>>> Logistically speaking, one possible avenue might be to use Apache
> >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
> is
> >>>>>> small, and it builds and installs fast. It is intended as a library
> to
> >>>>>> have its headers used and linked against other applications. (As an
> >>>>>> aside, I'm very interested in building optional support for Arrow
> >>>>>> columnar messages into the kudu client).
> >>>>>>
> >>>>>
> >>>>> In principle I'm in favour of code sharing, and it seems very much in
> >>>>> keeping with the Apache way. However, practically speaking I'm of the
> >>>>> opinion that it only makes sense to house shared support code in a
> >>>>> separate, dedicated project.
> >>>>>
> >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
> >>> scope
> >>>>> of sharing to utilities that Arrow is interested in. It would make no
> >>>> sense
> >>>>> to add a threading library to Arrow if it was never used natively.
> >>>> Muddying
> >>>>> the waters of the project's charter seems likely to lead to user, and
> >>>>> developer, confusion. Similarly, we should not necessarily couple
> >>> Arrow's
> >>>>> design goals to those it inherits from Kudu and Impala's source code.
> >>>>>
> >>>>> I think I'd rather see a new Apache project than re-use a current one
> >>> for
> >>>>> two independent purposes.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The downside of code sharing, which may have prevented it so far,
> are
> >>>>>> the logistics of coordinating ASF release cycles and keeping build
> >>>>>> toolchains in sync. It's taken us the past year to stabilize the
> >>>>>> design of Arrow for its intended use cases, so at this point if we
> >>>>>> went down this road I would be OK with helping the community commit
> to
> >>>>>> a regular release cadence that would be faster than Impala, Kudu,
> and
> >>>>>> Parquet's respective release cadences. Since members of the Kudu and
> >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >>>>>> collaborate to each other's mutual benefit and success.
> >>>>>>
> >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
> >>>>>> Google C++ style guide to the same extent at Kudu and Impala.
> >>>>>>
> >>>>>> If this is something that either the Kudu or Impala communities
> would
> >>>>>> like to pursue in earnest, I would be happy to work with you on next
> >>>>>> steps. I would suggest that we start with something small so that we
> >>>>>> could address the necessary build toolchain changes, and develop a
> >>>>>> workflow for moving around code and tests, a protocol for code
> reviews
> >>>>>> (e.g. Gerrit), and coordinating ASF releases.
> >>>>>>
> >>>>>
> >>>>> I think, if I'm reading this correctly, that you're assuming
> >>> integration
> >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
> via
> >>>>> their toolchains. For something as fast moving as utility code - and
> >>>>> critical, where you want the latency between adding a fix and
> including
> >>>> it
> >>>>> in your build to be ~0 - that's a non-starter to me, at least with
> how
> >>>> the
> >>>>> toolchains are currently realised.
> >>>>>
> >>>>> I'd rather have the source code directly imported into Impala's tree
> -
> >>>>> whether by git submodule or other mechanism. That way the coupling is
> >>>>> looser, and we can move more quickly. I think that's important to
> other
> >>>>> projects as well.
> >>>>>
> >>>>> Henry
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Let me know what you think.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >> --
> >> --
> >> Cheers,
> >> Leif
>
> --
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:

> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm keen to move forward to coming up with a concrete
> > plan if we can reach consensus on the particulars.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com>
> wrote:
> >> I also support the idea of creating an "apache commons modern c++" style
> >> library, maybe tailored toward the needs of columnar data processing
> >> tools.  I think APR is the wrong project but I think that *style* of
> >> project is the right direction to aim.
> >>
> >> I agree this adds test and release process complexity across products
> but I
> >> think the benefits of a shared, well-tested library outweigh that, and
> >> creating such test infrastructure will have long-term benefits as well.
> >>
> >> I'd be happy to lend a hand wherever it's needed.
> >>
> >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> As Henry mentioned, Impala is starting to share more code with Kudu
> (most
> >>> notably our RPC system, but that pulls in a fair bit of utility code as
> >>> well), so we've been chatting periodically offline about the best way
> to do
> >>> this. Having more projects potentially interested in collaborating is
> >>> definitely welcome, though I think does also increase the complexity of
> >>> whatever solution we come up with.
> >>>
> >>> I think the potential benefits of collaboration are fairly
> self-evident, so
> >>> I'll focus on my concerns here, which somewhat echo Henry's.
> >>>
> >>> 1) Open source release model
> >>>
> >>> The ASF is very much against having projects which do not do releases.
> So,
> >>> if we were to create some new ASF project to hold this code, we'd be
> >>> expected to do frequent releases thereof. Wes volunteered above to lead
> >>> frequent releases, but we actually need at least 3 PMC members to vote
> on
> >>> each release, and given people can come and go, we'd probably need at
> least
> >>> 5-8 people who are actively committed to helping with the release
> process
> >>> of this "commons" project.
> >>>
> >>> Unlike our existing projects, which seem to release every 2-3 months,
> if
> >>> that, I think this one would have to release _much_ more frequently,
> if we
> >>> expect downstream projects to depend on released versions rather than
> just
> >>> pulling in some recent (or even trunk) git hash. Since the ASF
> requires the
> >>> normal voting period and process for every release, I don't think we
> could
> >>> do something like have "daily automatic releases", etc.
> >>>
> >>> We could probably campaign the ASF membership to treat this project
> >>> differently, either as (a) a repository of code that never releases, in
> >>> which case the "downstream" projects are responsible for vetting IP,
> etc,
> >>> as part of their own release processes, or (b) a project which does
> >>> automatic releases voted upon by robots. I'm guessing that (a) is more
> >>> palatable from an IP perspective, and also from the perspective of the
> >>> downstream projects.
> >>>
> >>>
> >>> 2) Governance/review model
> >>>
> >>> The more projects there are sharing this common code, the more
> difficult it
> >>> is to know whether a change would break something, or even whether a
> change
> >>> is considered desirable for all of the projects. I don't want to get
> into
> >>> some world where any change to a central library requires a multi-week
> >>> proposal/design-doc/review across 3+ different groups of committers,
> all of
> >>> whom may have different near-term priorities. On the other hand, it
> would
> >>> be pretty frustrating if the week before we're trying to cut a Kudu
> release
> >>> branch, someone in another community decides to make a potentially
> >>> destabilizing change to the RPC library.
> >>>
> >>>
> >>> 3) Pre-commit/test mechanics
> >>>
> >>> Semi-related to the above: we currently feel pretty confident when we
> make
> >>> a change to a central library like kudu/util/thread.cc that nothing
> broke
> >>> because we run the full suite of Kudu tests. Of course the central
> >>> libraries have some unit test coverage, but I wouldn't be confident
> with
> >>> any sort of model where shared code can change without verification by
> a
> >>> larger suite of tests.
> >>>
> >>> On the other hand, I also don't want to move to a model where any
> change to
> >>> shared code requires a 6+-hour precommit spanning several projects,
> each of
> >>> which may have its own set of potentially-flaky pre-commit tests, etc.
> I
> >>> can imagine that if an Arrow developer made some change to "thread.cc"
> and
> >>> saw that TabletServerStressTest failed their precommit, they'd have no
> idea
> >>> how to triage it, etc. That could be a strong disincentive to continued
> >>> innovation in these areas of common code, which we'll need a good way
> to
> >>> avoid.
> >>>
> >>> I think some of the above could be ameliorated with really good
> >>> infrastructure -- eg on a test failure, automatically re-run the failed
> >>> test on both pre-patch and post-patch, do a t-test to check statistical
> >>> significance in flakiness level, etc. But, that's a lot of
> infrastructure
> >>> that doesn't currently exist.
> >>>
> >>>
> >>> 4) Integration mechanics for breaking changes
> >>>
> >>> Currently these common libraries are treated as components of
> monolithic
> >>> projects. That means it's no extra overhead for us to make some kind of
> >>> change which breaks an API in src/kudu/util/ and at the same time
> updates
> >>> all call sites. The internal libraries have no semblance of API
> >>> compatibility guarantees, etc, and adding one is not without cost.
> >>>
> >>> Before sharing code, we should figure out how exactly we'll manage the
> >>> cases where we want to make some change in a common library that
> breaks an
> >>> API used by other projects, given there's no way to make an atomic
> commit
> >>> across many repositories. One option is that each "user" of the
> libraries
> >>> manually "rolls" to new versions when they feel like it, but there's
> still
> >>> now a case where a common change "pushes work onto" the consumers to
> update
> >>> call sites, etc.
> >>>
> >>> Admittedly, the number of breaking API changes in these common
> libraries is
> >>> relatively small, but would still be good to understand how we would
> plan
> >>> to manage them.
> >>>
> >>> -Todd
> >>>
> >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>
> >>>> hi Henry,
> >>>>
> >>>> Thank you for these comments.
> >>>>
> >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
> >>>> ideal (though perhaps initially more labor intensive) solution.
> >>>> There's code in Arrow that I would move into this project if it
> >>>> existed. I am happy to help make this happen if there is interest from
> >>>> the Kudu and Impala communities. I am not sure logistically what would
> >>>> be the most expedient way to establish the project, whether as an ASF
> >>>> Incubator project or possibly as a new TLP that could be created by
> >>>> spinning IP out of Apache Kudu.
> >>>>
> >>>> I'm interested to hear the opinions of others, and possible next
> steps.
> >>>>
> >>>> Thanks
> >>>> Wes
> >>>>
> >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> >>> wrote:
> >>>>> Thanks for bringing this up, Wes.
> >>>>>
> >>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>>>>>
> >>>>>> (I'm not sure the best way to have a cross-list discussion, so I
> >>>>>> apologize if this does not work well)
> >>>>>>
> >>>>>> On the recent Apache Parquet sync call, we discussed C++ code
> sharing
> >>>>>> between the codebases in Apache Arrow and Apache Parquet, and
> >>>>>> opportunities for more code sharing with Kudu and Impala as well.
> >>>>>>
> >>>>>> As context
> >>>>>>
> >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >>>>>> first C++ release within Apache Parquet. I got involved with this
> >>>>>> project a little over a year ago and was faced with the unpleasant
> >>>>>> decision to copy and paste a significant amount of code out of
> >>>>>> Impala's codebase to bootstrap the project.
> >>>>>>
> >>>>>> * In parallel, we begin the Apache Arrow project, which is designed
> to
> >>>>>> be a complementary library for file formats (like Parquet), storage
> >>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
> >>>>>>
> >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
> >>>>>> overlap crept up surrounding buffer memory management and IO
> >>>>>> interface. We recently decided in PARQUET-818
> >>>>>> (https://github.com/apache/parquet-cpp/commit/
> >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >>>>>> to remove some of the obvious code overlap in Parquet and make
> >>>>>> libarrow.a/so a hard compile and link-time dependency for
> >>>>>> libparquet.a/so.
> >>>>>>
> >>>>>> * There is still quite a bit of code in parquet-cpp that would
> better
> >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
> encoding,
> >>>>>> compression, bit utilities, and so forth. Much of this code
> originated
> >>>>>> from Impala
> >>>>>>
> >>>>>> This brings me to a next set of points:
> >>>>>>
> >>>>>> * parquet-cpp contains quite a bit of code that was extracted from
> >>>>>> Impala. This is mostly self-contained in
> >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>>>>>
> >>>>>> * My understanding is that Kudu extracted certain computational
> >>>>>> utilities from Impala in its early days, but these tools have likely
> >>>>>> diverged as the needs of the projects have evolved.
> >>>>>>
> >>>>>> Since all of these projects are quite different in their end goals
> >>>>>> (runtime systems vs. libraries), touching code that is tightly
> coupled
> >>>>>> to either Kudu or Impala's runtimes is probably not worth
> discussing.
> >>>>>> However, I think there is a strong basis for collaboration on
> >>>>>> computational utilities and vectorized array processing. Some
> obvious
> >>>>>> areas that come to mind:
> >>>>>>
> >>>>>> * SIMD utilities (for hashing or processing of preallocated
> contiguous
> >>>>>> memory)
> >>>>>> * Array encoding utilities: RLE / Dictionary, etc.
> >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >>>>>> contributed a patch to parquet-cpp around this)
> >>>>>> * Date and time utilities
> >>>>>> * Compression utilities
> >>>>>>
> >>>>>
> >>>>> Between Kudu and Impala (at least) there are many more opportunities
> >>> for
> >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
> is
> >>>>> quite long.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> I hope the benefits are obvious: consolidating efforts on unit
> >>>>>> testing, benchmarking, performance optimizations, continuous
> >>>>>> integration, and platform compatibility.
> >>>>>>
> >>>>>> Logistically speaking, one possible avenue might be to use Apache
> >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
> is
> >>>>>> small, and it builds and installs fast. It is intended as a library
> to
> >>>>>> have its headers used and linked against other applications. (As an
> >>>>>> aside, I'm very interested in building optional support for Arrow
> >>>>>> columnar messages into the kudu client).
> >>>>>>
> >>>>>
> >>>>> In principle I'm in favour of code sharing, and it seems very much in
> >>>>> keeping with the Apache way. However, practically speaking I'm of the
> >>>>> opinion that it only makes sense to house shared support code in a
> >>>>> separate, dedicated project.
> >>>>>
> >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
> >>> scope
> >>>>> of sharing to utilities that Arrow is interested in. It would make no
> >>>> sense
> >>>>> to add a threading library to Arrow if it was never used natively.
> >>>> Muddying
> >>>>> the waters of the project's charter seems likely to lead to user, and
> >>>>> developer, confusion. Similarly, we should not necessarily couple
> >>> Arrow's
> >>>>> design goals to those it inherits from Kudu and Impala's source code.
> >>>>>
> >>>>> I think I'd rather see a new Apache project than re-use a current one
> >>> for
> >>>>> two independent purposes.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The downside of code sharing, which may have prevented it so far,
> are
> >>>>>> the logistics of coordinating ASF release cycles and keeping build
> >>>>>> toolchains in sync. It's taken us the past year to stabilize the
> >>>>>> design of Arrow for its intended use cases, so at this point if we
> >>>>>> went down this road I would be OK with helping the community commit
> to
> >>>>>> a regular release cadence that would be faster than Impala, Kudu,
> and
> >>>>>> Parquet's respective release cadences. Since members of the Kudu and
> >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >>>>>> collaborate to each other's mutual benefit and success.
> >>>>>>
> >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
> >>>>>> Google C++ style guide to the same extent at Kudu and Impala.
> >>>>>>
> >>>>>> If this is something that either the Kudu or Impala communities
> would
> >>>>>> like to pursue in earnest, I would be happy to work with you on next
> >>>>>> steps. I would suggest that we start with something small so that we
> >>>>>> could address the necessary build toolchain changes, and develop a
> >>>>>> workflow for moving around code and tests, a protocol for code
> reviews
> >>>>>> (e.g. Gerrit), and coordinating ASF releases.
> >>>>>>
> >>>>>
> >>>>> I think, if I'm reading this correctly, that you're assuming
> >>> integration
> >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
> via
> >>>>> their toolchains. For something as fast moving as utility code - and
> >>>>> critical, where you want the latency between adding a fix and
> including
> >>>> it
> >>>>> in your build to be ~0 - that's a non-starter to me, at least with
> how
> >>>> the
> >>>>> toolchains are currently realised.
> >>>>>
> >>>>> I'd rather have the source code directly imported into Impala's tree
> -
> >>>>> whether by git submodule or other mechanism. That way the coupling is
> >>>>> looser, and we can move more quickly. I think that's important to
> other
> >>>>> projects as well.
> >>>>>
> >>>>> Henry
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Let me know what you think.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >> --
> >> --
> >> Cheers,
> >> Leif
>
> --
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

Julian, are you proposing the arrow project ship two artifacts,
arrow-common and arrow, where arrow depends on arrow-common?
On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org> wrote:

> “Commons” projects are often problematic. It is difficult to tell what is
> in scope and out of scope. If the scope is drawn too wide, there is a real
> problem of orphaned features, because people contribute one feature and
> then disappear.
>
> Let’s remember the Apache mantra: community over code. If you create a
> sustainable community, the code will get looked after. Would this project
> form a new community, or just a new piece of code? As I read the current
> proposal, it would be the intersection of some existing communities, not a
> new community.
>
> I think it would take a considerable effort to create a new project and
> community around the idea of “c++ commons” (or is it “database-related c++
> commons”?). I think you already have such a community, to a first
> approximation, in the Arrow project, because Kudu and Impala developers are
> already part of the Arrow community. There’s no reason why Arrow cannot
> contain new modules that have different release schedules than the rest of
> Arrow. As a TLP, releases are less burdensome, and can happen in a little
> over 3 days if the component is kept stable.
>
> Lastly, the code is fungible. It can be marked “experimental” within Arrow
> and moved to another project, or into a new project, as it matures. The
> Apache license and the ASF CLA makes this very easy. We are doing something
> like this in Calcite: the Avatica sub-project [1] has a community that
> intersect’s with Calcite’s, is disconnected at a code level, and may over
> time evolve into a separate project. In the mean time, being part of an
> established project is helpful, because there are PMC members to vote.
>
> Julian
>
> [1] https://calcite.apache.org/avatica/ <
> https://calcite.apache.org/avatica/>
>
> > On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Responding to Todd's e-mail:
> >
> > 1) Open source release model
> >
> > My expectation is that this library would release about once a month,
> > with occasional faster releases for critical fixes.
> >
> > 2) Governance/review model
> >
> > Beyond having centralized code reviews, it's hard to predict how the
> > governance would play out. I understand that OSS projects behave
> > differently in their planning / design / review process, so work on a
> > common need may require more of a negotiation than the prior
> > "unilateral" process.
> >
> > I think it says something for our communities that we would make a
> > commitment in our collaboration on this to the success of the
> > "consumer" projects. So if the Arrow or Parquet communities were
> > contemplating a change that might impact Kudu, for example, it would
> > be in our best interest to be careful and communicate proactively.
> >
> > This all makes sense. From an Arrow and Parquet perspective, we do not
> > add very much testing burden because our continuous integration suites
> > do not take long to run.
> >
> > 3) Pre-commit/test mechanics
> >
> > One thing that would help would be community-maintained
> > Dockerfiles/Docker images (or equivalent) to assist with validation
> > and testing for developers.
> >
> > I am happy to comply with a pre-commit testing protocol that works for
> > the Kudu and Impala teams.
> >
> > 4) Integration mechanics for breaking changes
> >
> >> One option is that each "user" of the libraries manually "rolls" to new
> versions when they feel like it, but there's still now a case where a
> common change "pushes work onto" the consumers to update call sites, etc.
> >
> > Breaking API changes will create extra work, because any automated
> > testing that we create will not be able to validate the patch to the
> > common library. Perhaps we can configure a manual way (in Jenkins,
> > say) to test two patches together.
> >
> > In the event that a community member has a patch containing an API
> > break that impacts a project that they are not a contributor for,
> > there should be some expectation to either work with the affected
> > project on a coordinated patch or obtain their +1 to merge the patch
> > even though it will may require a follow up patch if the roll-forward
> > in the consumer project exposes bugs in the common library. There may
> > be situations like:
> >
> > * Kudu changes API in $COMMON that impacts Arrow
> > * Arrow says +1, we will roll forward $COMMON later
> > * Patch merged
> > * Arrow rolls forward, discovers bug caused by patch in $COMMON
> > * Arrow proposes patch to $COMMON
> > * ...
> >
> > This is the worst case scenario, of course, but I actually think it is
> > good because it would indicate that the unit testing in $COMMON needs
> > to be improved. Unit testing in the common library, therefore, would
> > take on more of a "defensive" quality than currently.
> >
> > In any case, I'm keen to move forward to coming up with a concrete
> > plan if we can reach consensus on the particulars.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com>
> wrote:
> >> I also support the idea of creating an "apache commons modern c++" style
> >> library, maybe tailored toward the needs of columnar data processing
> >> tools.  I think APR is the wrong project but I think that *style* of
> >> project is the right direction to aim.
> >>
> >> I agree this adds test and release process complexity across products
> but I
> >> think the benefits of a shared, well-tested library outweigh that, and
> >> creating such test infrastructure will have long-term benefits as well.
> >>
> >> I'd be happy to lend a hand wherever it's needed.
> >>
> >> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> As Henry mentioned, Impala is starting to share more code with Kudu
> (most
> >>> notably our RPC system, but that pulls in a fair bit of utility code as
> >>> well), so we've been chatting periodically offline about the best way
> to do
> >>> this. Having more projects potentially interested in collaborating is
> >>> definitely welcome, though I think does also increase the complexity of
> >>> whatever solution we come up with.
> >>>
> >>> I think the potential benefits of collaboration are fairly
> self-evident, so
> >>> I'll focus on my concerns here, which somewhat echo Henry's.
> >>>
> >>> 1) Open source release model
> >>>
> >>> The ASF is very much against having projects which do not do releases.
> So,
> >>> if we were to create some new ASF project to hold this code, we'd be
> >>> expected to do frequent releases thereof. Wes volunteered above to lead
> >>> frequent releases, but we actually need at least 3 PMC members to vote
> on
> >>> each release, and given people can come and go, we'd probably need at
> least
> >>> 5-8 people who are actively committed to helping with the release
> process
> >>> of this "commons" project.
> >>>
> >>> Unlike our existing projects, which seem to release every 2-3 months,
> if
> >>> that, I think this one would have to release _much_ more frequently,
> if we
> >>> expect downstream projects to depend on released versions rather than
> just
> >>> pulling in some recent (or even trunk) git hash. Since the ASF
> requires the
> >>> normal voting period and process for every release, I don't think we
> could
> >>> do something like have "daily automatic releases", etc.
> >>>
> >>> We could probably campaign the ASF membership to treat this project
> >>> differently, either as (a) a repository of code that never releases, in
> >>> which case the "downstream" projects are responsible for vetting IP,
> etc,
> >>> as part of their own release processes, or (b) a project which does
> >>> automatic releases voted upon by robots. I'm guessing that (a) is more
> >>> palatable from an IP perspective, and also from the perspective of the
> >>> downstream projects.
> >>>
> >>>
> >>> 2) Governance/review model
> >>>
> >>> The more projects there are sharing this common code, the more
> difficult it
> >>> is to know whether a change would break something, or even whether a
> change
> >>> is considered desirable for all of the projects. I don't want to get
> into
> >>> some world where any change to a central library requires a multi-week
> >>> proposal/design-doc/review across 3+ different groups of committers,
> all of
> >>> whom may have different near-term priorities. On the other hand, it
> would
> >>> be pretty frustrating if the week before we're trying to cut a Kudu
> release
> >>> branch, someone in another community decides to make a potentially
> >>> destabilizing change to the RPC library.
> >>>
> >>>
> >>> 3) Pre-commit/test mechanics
> >>>
> >>> Semi-related to the above: we currently feel pretty confident when we
> make
> >>> a change to a central library like kudu/util/thread.cc that nothing
> broke
> >>> because we run the full suite of Kudu tests. Of course the central
> >>> libraries have some unit test coverage, but I wouldn't be confident
> with
> >>> any sort of model where shared code can change without verification by
> a
> >>> larger suite of tests.
> >>>
> >>> On the other hand, I also don't want to move to a model where any
> change to
> >>> shared code requires a 6+-hour precommit spanning several projects,
> each of
> >>> which may have its own set of potentially-flaky pre-commit tests, etc.
> I
> >>> can imagine that if an Arrow developer made some change to "thread.cc"
> and
> >>> saw that TabletServerStressTest failed their precommit, they'd have no
> idea
> >>> how to triage it, etc. That could be a strong disincentive to continued
> >>> innovation in these areas of common code, which we'll need a good way
> to
> >>> avoid.
> >>>
> >>> I think some of the above could be ameliorated with really good
> >>> infrastructure -- eg on a test failure, automatically re-run the failed
> >>> test on both pre-patch and post-patch, do a t-test to check statistical
> >>> significance in flakiness level, etc. But, that's a lot of
> infrastructure
> >>> that doesn't currently exist.
> >>>
> >>>
> >>> 4) Integration mechanics for breaking changes
> >>>
> >>> Currently these common libraries are treated as components of
> monolithic
> >>> projects. That means it's no extra overhead for us to make some kind of
> >>> change which breaks an API in src/kudu/util/ and at the same time
> updates
> >>> all call sites. The internal libraries have no semblance of API
> >>> compatibility guarantees, etc, and adding one is not without cost.
> >>>
> >>> Before sharing code, we should figure out how exactly we'll manage the
> >>> cases where we want to make some change in a common library that
> breaks an
> >>> API used by other projects, given there's no way to make an atomic
> commit
> >>> across many repositories. One option is that each "user" of the
> libraries
> >>> manually "rolls" to new versions when they feel like it, but there's
> still
> >>> now a case where a common change "pushes work onto" the consumers to
> update
> >>> call sites, etc.
> >>>
> >>> Admittedly, the number of breaking API changes in these common
> libraries is
> >>> relatively small, but would still be good to understand how we would
> plan
> >>> to manage them.
> >>>
> >>> -Todd
> >>>
> >>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>
> >>>> hi Henry,
> >>>>
> >>>> Thank you for these comments.
> >>>>
> >>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
> >>>> ideal (though perhaps initially more labor intensive) solution.
> >>>> There's code in Arrow that I would move into this project if it
> >>>> existed. I am happy to help make this happen if there is interest from
> >>>> the Kudu and Impala communities. I am not sure logistically what would
> >>>> be the most expedient way to establish the project, whether as an ASF
> >>>> Incubator project or possibly as a new TLP that could be created by
> >>>> spinning IP out of Apache Kudu.
> >>>>
> >>>> I'm interested to hear the opinions of others, and possible next
> steps.
> >>>>
> >>>> Thanks
> >>>> Wes
> >>>>
> >>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> >>> wrote:
> >>>>> Thanks for bringing this up, Wes.
> >>>>>
> >>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>>>>>
> >>>>>> (I'm not sure the best way to have a cross-list discussion, so I
> >>>>>> apologize if this does not work well)
> >>>>>>
> >>>>>> On the recent Apache Parquet sync call, we discussed C++ code
> sharing
> >>>>>> between the codebases in Apache Arrow and Apache Parquet, and
> >>>>>> opportunities for more code sharing with Kudu and Impala as well.
> >>>>>>
> >>>>>> As context
> >>>>>>
> >>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >>>>>> first C++ release within Apache Parquet. I got involved with this
> >>>>>> project a little over a year ago and was faced with the unpleasant
> >>>>>> decision to copy and paste a significant amount of code out of
> >>>>>> Impala's codebase to bootstrap the project.
> >>>>>>
> >>>>>> * In parallel, we begin the Apache Arrow project, which is designed
> to
> >>>>>> be a complementary library for file formats (like Parquet), storage
> >>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
> >>>>>>
> >>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
> >>>>>> overlap crept up surrounding buffer memory management and IO
> >>>>>> interface. We recently decided in PARQUET-818
> >>>>>> (https://github.com/apache/parquet-cpp/commit/
> >>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >>>>>> to remove some of the obvious code overlap in Parquet and make
> >>>>>> libarrow.a/so a hard compile and link-time dependency for
> >>>>>> libparquet.a/so.
> >>>>>>
> >>>>>> * There is still quite a bit of code in parquet-cpp that would
> better
> >>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
> encoding,
> >>>>>> compression, bit utilities, and so forth. Much of this code
> originated
> >>>>>> from Impala
> >>>>>>
> >>>>>> This brings me to a next set of points:
> >>>>>>
> >>>>>> * parquet-cpp contains quite a bit of code that was extracted from
> >>>>>> Impala. This is mostly self-contained in
> >>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>>>>>
> >>>>>> * My understanding is that Kudu extracted certain computational
> >>>>>> utilities from Impala in its early days, but these tools have likely
> >>>>>> diverged as the needs of the projects have evolved.
> >>>>>>
> >>>>>> Since all of these projects are quite different in their end goals
> >>>>>> (runtime systems vs. libraries), touching code that is tightly
> coupled
> >>>>>> to either Kudu or Impala's runtimes is probably not worth
> discussing.
> >>>>>> However, I think there is a strong basis for collaboration on
> >>>>>> computational utilities and vectorized array processing. Some
> obvious
> >>>>>> areas that come to mind:
> >>>>>>
> >>>>>> * SIMD utilities (for hashing or processing of preallocated
> contiguous
> >>>>>> memory)
> >>>>>> * Array encoding utilities: RLE / Dictionary, etc.
> >>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >>>>>> contributed a patch to parquet-cpp around this)
> >>>>>> * Date and time utilities
> >>>>>> * Compression utilities
> >>>>>>
> >>>>>
> >>>>> Between Kudu and Impala (at least) there are many more opportunities
> >>> for
> >>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
> is
> >>>>> quite long.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> I hope the benefits are obvious: consolidating efforts on unit
> >>>>>> testing, benchmarking, performance optimizations, continuous
> >>>>>> integration, and platform compatibility.
> >>>>>>
> >>>>>> Logistically speaking, one possible avenue might be to use Apache
> >>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
> is
> >>>>>> small, and it builds and installs fast. It is intended as a library
> to
> >>>>>> have its headers used and linked against other applications. (As an
> >>>>>> aside, I'm very interested in building optional support for Arrow
> >>>>>> columnar messages into the kudu client).
> >>>>>>
> >>>>>
> >>>>> In principle I'm in favour of code sharing, and it seems very much in
> >>>>> keeping with the Apache way. However, practically speaking I'm of the
> >>>>> opinion that it only makes sense to house shared support code in a
> >>>>> separate, dedicated project.
> >>>>>
> >>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
> >>> scope
> >>>>> of sharing to utilities that Arrow is interested in. It would make no
> >>>> sense
> >>>>> to add a threading library to Arrow if it was never used natively.
> >>>> Muddying
> >>>>> the waters of the project's charter seems likely to lead to user, and
> >>>>> developer, confusion. Similarly, we should not necessarily couple
> >>> Arrow's
> >>>>> design goals to those it inherits from Kudu and Impala's source code.
> >>>>>
> >>>>> I think I'd rather see a new Apache project than re-use a current one
> >>> for
> >>>>> two independent purposes.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The downside of code sharing, which may have prevented it so far,
> are
> >>>>>> the logistics of coordinating ASF release cycles and keeping build
> >>>>>> toolchains in sync. It's taken us the past year to stabilize the
> >>>>>> design of Arrow for its intended use cases, so at this point if we
> >>>>>> went down this road I would be OK with helping the community commit
> to
> >>>>>> a regular release cadence that would be faster than Impala, Kudu,
> and
> >>>>>> Parquet's respective release cadences. Since members of the Kudu and
> >>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >>>>>> collaborate to each other's mutual benefit and success.
> >>>>>>
> >>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
> >>>>>> Google C++ style guide to the same extent at Kudu and Impala.
> >>>>>>
> >>>>>> If this is something that either the Kudu or Impala communities
> would
> >>>>>> like to pursue in earnest, I would be happy to work with you on next
> >>>>>> steps. I would suggest that we start with something small so that we
> >>>>>> could address the necessary build toolchain changes, and develop a
> >>>>>> workflow for moving around code and tests, a protocol for code
> reviews
> >>>>>> (e.g. Gerrit), and coordinating ASF releases.
> >>>>>>
> >>>>>
> >>>>> I think, if I'm reading this correctly, that you're assuming
> >>> integration
> >>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
> via
> >>>>> their toolchains. For something as fast moving as utility code - and
> >>>>> critical, where you want the latency between adding a fix and
> including
> >>>> it
> >>>>> in your build to be ~0 - that's a non-starter to me, at least with
> how
> >>>> the
> >>>>> toolchains are currently realised.
> >>>>>
> >>>>> I'd rather have the source code directly imported into Impala's tree
> -
> >>>>> whether by git submodule or other mechanism. That way the coupling is
> >>>>> looser, and we can move more quickly. I think that's important to
> other
> >>>>> projects as well.
> >>>>>
> >>>>> Henry
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Let me know what you think.
> >>>>>>
> >>>>>> best
> >>>>>> Wes
> >>>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >> --
> >> --
> >> Cheers,
> >> Leif
>
> --
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Julian Hyde <jh...@apache.org>.

“Commons” projects are often problematic. It is difficult to tell what is in scope and out of scope. If the scope is drawn too wide, there is a real problem of orphaned features, because people contribute one feature and then disappear.

Let’s remember the Apache mantra: community over code. If you create a sustainable community, the code will get looked after. Would this project form a new community, or just a new piece of code? As I read the current proposal, it would be the intersection of some existing communities, not a new community.

I think it would take a considerable effort to create a new project and community around the idea of “c++ commons” (or is it “database-related c++ commons”?). I think you already have such a community, to a first approximation, in the Arrow project, because Kudu and Impala developers are already part of the Arrow community. There’s no reason why Arrow cannot contain new modules that have different release schedules than the rest of Arrow. As a TLP, releases are less burdensome, and can happen in a little over 3 days if the component is kept stable.

Lastly, the code is fungible. It can be marked “experimental” within Arrow and moved to another project, or into a new project, as it matures. The Apache license and the ASF CLA makes this very easy. We are doing something like this in Calcite: the Avatica sub-project [1] has a community that intersect’s with Calcite’s, is disconnected at a code level, and may over time evolve into a separate project. In the mean time, being part of an established project is helpful, because there are PMC members to vote.

Julian

[1] https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>

> On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Responding to Todd's e-mail:
> 
> 1) Open source release model
> 
> My expectation is that this library would release about once a month,
> with occasional faster releases for critical fixes.
> 
> 2) Governance/review model
> 
> Beyond having centralized code reviews, it's hard to predict how the
> governance would play out. I understand that OSS projects behave
> differently in their planning / design / review process, so work on a
> common need may require more of a negotiation than the prior
> "unilateral" process.
> 
> I think it says something for our communities that we would make a
> commitment in our collaboration on this to the success of the
> "consumer" projects. So if the Arrow or Parquet communities were
> contemplating a change that might impact Kudu, for example, it would
> be in our best interest to be careful and communicate proactively.
> 
> This all makes sense. From an Arrow and Parquet perspective, we do not
> add very much testing burden because our continuous integration suites
> do not take long to run.
> 
> 3) Pre-commit/test mechanics
> 
> One thing that would help would be community-maintained
> Dockerfiles/Docker images (or equivalent) to assist with validation
> and testing for developers.
> 
> I am happy to comply with a pre-commit testing protocol that works for
> the Kudu and Impala teams.
> 
> 4) Integration mechanics for breaking changes
> 
>> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.
> 
> Breaking API changes will create extra work, because any automated
> testing that we create will not be able to validate the patch to the
> common library. Perhaps we can configure a manual way (in Jenkins,
> say) to test two patches together.
> 
> In the event that a community member has a patch containing an API
> break that impacts a project that they are not a contributor for,
> there should be some expectation to either work with the affected
> project on a coordinated patch or obtain their +1 to merge the patch
> even though it will may require a follow up patch if the roll-forward
> in the consumer project exposes bugs in the common library. There may
> be situations like:
> 
> * Kudu changes API in $COMMON that impacts Arrow
> * Arrow says +1, we will roll forward $COMMON later
> * Patch merged
> * Arrow rolls forward, discovers bug caused by patch in $COMMON
> * Arrow proposes patch to $COMMON
> * ...
> 
> This is the worst case scenario, of course, but I actually think it is
> good because it would indicate that the unit testing in $COMMON needs
> to be improved. Unit testing in the common library, therefore, would
> take on more of a "defensive" quality than currently.
> 
> In any case, I'm keen to move forward to coming up with a concrete
> plan if we can reach consensus on the particulars.
> 
> Thanks
> Wes
> 
> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
>> I also support the idea of creating an "apache commons modern c++" style
>> library, maybe tailored toward the needs of columnar data processing
>> tools.  I think APR is the wrong project but I think that *style* of
>> project is the right direction to aim.
>> 
>> I agree this adds test and release process complexity across products but I
>> think the benefits of a shared, well-tested library outweigh that, and
>> creating such test infrastructure will have long-term benefits as well.
>> 
>> I'd be happy to lend a hand wherever it's needed.
>> 
>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> Hey folks,
>>> 
>>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>> well), so we've been chatting periodically offline about the best way to do
>>> this. Having more projects potentially interested in collaborating is
>>> definitely welcome, though I think does also increase the complexity of
>>> whatever solution we come up with.
>>> 
>>> I think the potential benefits of collaboration are fairly self-evident, so
>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>> 
>>> 1) Open source release model
>>> 
>>> The ASF is very much against having projects which do not do releases. So,
>>> if we were to create some new ASF project to hold this code, we'd be
>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>> frequent releases, but we actually need at least 3 PMC members to vote on
>>> each release, and given people can come and go, we'd probably need at least
>>> 5-8 people who are actively committed to helping with the release process
>>> of this "commons" project.
>>> 
>>> Unlike our existing projects, which seem to release every 2-3 months, if
>>> that, I think this one would have to release _much_ more frequently, if we
>>> expect downstream projects to depend on released versions rather than just
>>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>>> normal voting period and process for every release, I don't think we could
>>> do something like have "daily automatic releases", etc.
>>> 
>>> We could probably campaign the ASF membership to treat this project
>>> differently, either as (a) a repository of code that never releases, in
>>> which case the "downstream" projects are responsible for vetting IP, etc,
>>> as part of their own release processes, or (b) a project which does
>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>> palatable from an IP perspective, and also from the perspective of the
>>> downstream projects.
>>> 
>>> 
>>> 2) Governance/review model
>>> 
>>> The more projects there are sharing this common code, the more difficult it
>>> is to know whether a change would break something, or even whether a change
>>> is considered desirable for all of the projects. I don't want to get into
>>> some world where any change to a central library requires a multi-week
>>> proposal/design-doc/review across 3+ different groups of committers, all of
>>> whom may have different near-term priorities. On the other hand, it would
>>> be pretty frustrating if the week before we're trying to cut a Kudu release
>>> branch, someone in another community decides to make a potentially
>>> destabilizing change to the RPC library.
>>> 
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> Semi-related to the above: we currently feel pretty confident when we make
>>> a change to a central library like kudu/util/thread.cc that nothing broke
>>> because we run the full suite of Kudu tests. Of course the central
>>> libraries have some unit test coverage, but I wouldn't be confident with
>>> any sort of model where shared code can change without verification by a
>>> larger suite of tests.
>>> 
>>> On the other hand, I also don't want to move to a model where any change to
>>> shared code requires a 6+-hour precommit spanning several projects, each of
>>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>>> can imagine that if an Arrow developer made some change to "thread.cc" and
>>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>>> how to triage it, etc. That could be a strong disincentive to continued
>>> innovation in these areas of common code, which we'll need a good way to
>>> avoid.
>>> 
>>> I think some of the above could be ameliorated with really good
>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>> significance in flakiness level, etc. But, that's a lot of infrastructure
>>> that doesn't currently exist.
>>> 
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>> Currently these common libraries are treated as components of monolithic
>>> projects. That means it's no extra overhead for us to make some kind of
>>> change which breaks an API in src/kudu/util/ and at the same time updates
>>> all call sites. The internal libraries have no semblance of API
>>> compatibility guarantees, etc, and adding one is not without cost.
>>> 
>>> Before sharing code, we should figure out how exactly we'll manage the
>>> cases where we want to make some change in a common library that breaks an
>>> API used by other projects, given there's no way to make an atomic commit
>>> across many repositories. One option is that each "user" of the libraries
>>> manually "rolls" to new versions when they feel like it, but there's still
>>> now a case where a common change "pushes work onto" the consumers to update
>>> call sites, etc.
>>> 
>>> Admittedly, the number of breaking API changes in these common libraries is
>>> relatively small, but would still be good to understand how we would plan
>>> to manage them.
>>> 
>>> -Todd
>>> 
>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> 
>>>> hi Henry,
>>>> 
>>>> Thank you for these comments.
>>>> 
>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>> ideal (though perhaps initially more labor intensive) solution.
>>>> There's code in Arrow that I would move into this project if it
>>>> existed. I am happy to help make this happen if there is interest from
>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>> be the most expedient way to establish the project, whether as an ASF
>>>> Incubator project or possibly as a new TLP that could be created by
>>>> spinning IP out of Apache Kudu.
>>>> 
>>>> I'm interested to hear the opinions of others, and possible next steps.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 
>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>>> Thanks for bringing this up, Wes.
>>>>> 
>>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>> 
>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>> apologize if this does not work well)
>>>>>> 
>>>>>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>> 
>>>>>> As context
>>>>>> 
>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>> decision to copy and paste a significant amount of code out of
>>>>>> Impala's codebase to bootstrap the project.
>>>>>> 
>>>>>> * In parallel, we begin the Apache Arrow project, which is designed to
>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>> 
>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>> interface. We recently decided in PARQUET-818
>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>> libparquet.a/so.
>>>>>> 
>>>>>> * There is still quite a bit of code in parquet-cpp that would better
>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>>>>>> compression, bit utilities, and so forth. Much of this code originated
>>>>>> from Impala
>>>>>> 
>>>>>> This brings me to a next set of points:
>>>>>> 
>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>> Impala. This is mostly self-contained in
>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>> 
>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>> diverged as the needs of the projects have evolved.
>>>>>> 
>>>>>> Since all of these projects are quite different in their end goals
>>>>>> (runtime systems vs. libraries), touching code that is tightly coupled
>>>>>> to either Kudu or Impala's runtimes is probably not worth discussing.
>>>>>> However, I think there is a strong basis for collaboration on
>>>>>> computational utilities and vectorized array processing. Some obvious
>>>>>> areas that come to mind:
>>>>>> 
>>>>>> * SIMD utilities (for hashing or processing of preallocated contiguous
>>>>>> memory)
>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>> contributed a patch to parquet-cpp around this)
>>>>>> * Date and time utilities
>>>>>> * Compression utilities
>>>>>> 
>>>>> 
>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>> for
>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list is
>>>>> quite long.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>> integration, and platform compatibility.
>>>>>> 
>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>>>>>> small, and it builds and installs fast. It is intended as a library to
>>>>>> have its headers used and linked against other applications. (As an
>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>> columnar messages into the kudu client).
>>>>>> 
>>>>> 
>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>> opinion that it only makes sense to house shared support code in a
>>>>> separate, dedicated project.
>>>>> 
>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>> scope
>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>> sense
>>>>> to add a threading library to Arrow if it was never used natively.
>>>> Muddying
>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>> developer, confusion. Similarly, we should not necessarily couple
>>> Arrow's
>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>> 
>>>>> I think I'd rather see a new Apache project than re-use a current one
>>> for
>>>>> two independent purposes.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> The downside of code sharing, which may have prevented it so far, are
>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>> went down this road I would be OK with helping the community commit to
>>>>>> a regular release cadence that would be faster than Impala, Kudu, and
>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>> collaborate to each other's mutual benefit and success.
>>>>>> 
>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>> 
>>>>>> If this is something that either the Kudu or Impala communities would
>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>> steps. I would suggest that we start with something small so that we
>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>> workflow for moving around code and tests, a protocol for code reviews
>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>> 
>>>>> 
>>>>> I think, if I'm reading this correctly, that you're assuming
>>> integration
>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>>>>> their toolchains. For something as fast moving as utility code - and
>>>>> critical, where you want the latency between adding a fix and including
>>>> it
>>>>> in your build to be ~0 - that's a non-starter to me, at least with how
>>>> the
>>>>> toolchains are currently realised.
>>>>> 
>>>>> I'd rather have the source code directly imported into Impala's tree -
>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>> looser, and we can move more quickly. I think that's important to other
>>>>> projects as well.
>>>>> 
>>>>> Henry
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Let me know what you think.
>>>>>> 
>>>>>> best
>>>>>> Wes
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> --
>> --
>> Cheers,
>> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Julian Hyde <jh...@apache.org>.

“Commons” projects are often problematic. It is difficult to tell what is in scope and out of scope. If the scope is drawn too wide, there is a real problem of orphaned features, because people contribute one feature and then disappear.

Let’s remember the Apache mantra: community over code. If you create a sustainable community, the code will get looked after. Would this project form a new community, or just a new piece of code? As I read the current proposal, it would be the intersection of some existing communities, not a new community.

I think it would take a considerable effort to create a new project and community around the idea of “c++ commons” (or is it “database-related c++ commons”?). I think you already have such a community, to a first approximation, in the Arrow project, because Kudu and Impala developers are already part of the Arrow community. There’s no reason why Arrow cannot contain new modules that have different release schedules than the rest of Arrow. As a TLP, releases are less burdensome, and can happen in a little over 3 days if the component is kept stable.

Lastly, the code is fungible. It can be marked “experimental” within Arrow and moved to another project, or into a new project, as it matures. The Apache license and the ASF CLA makes this very easy. We are doing something like this in Calcite: the Avatica sub-project [1] has a community that intersect’s with Calcite’s, is disconnected at a code level, and may over time evolve into a separate project. In the mean time, being part of an established project is helpful, because there are PMC members to vote.

Julian

[1] https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>

> On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Responding to Todd's e-mail:
> 
> 1) Open source release model
> 
> My expectation is that this library would release about once a month,
> with occasional faster releases for critical fixes.
> 
> 2) Governance/review model
> 
> Beyond having centralized code reviews, it's hard to predict how the
> governance would play out. I understand that OSS projects behave
> differently in their planning / design / review process, so work on a
> common need may require more of a negotiation than the prior
> "unilateral" process.
> 
> I think it says something for our communities that we would make a
> commitment in our collaboration on this to the success of the
> "consumer" projects. So if the Arrow or Parquet communities were
> contemplating a change that might impact Kudu, for example, it would
> be in our best interest to be careful and communicate proactively.
> 
> This all makes sense. From an Arrow and Parquet perspective, we do not
> add very much testing burden because our continuous integration suites
> do not take long to run.
> 
> 3) Pre-commit/test mechanics
> 
> One thing that would help would be community-maintained
> Dockerfiles/Docker images (or equivalent) to assist with validation
> and testing for developers.
> 
> I am happy to comply with a pre-commit testing protocol that works for
> the Kudu and Impala teams.
> 
> 4) Integration mechanics for breaking changes
> 
>> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.
> 
> Breaking API changes will create extra work, because any automated
> testing that we create will not be able to validate the patch to the
> common library. Perhaps we can configure a manual way (in Jenkins,
> say) to test two patches together.
> 
> In the event that a community member has a patch containing an API
> break that impacts a project that they are not a contributor for,
> there should be some expectation to either work with the affected
> project on a coordinated patch or obtain their +1 to merge the patch
> even though it will may require a follow up patch if the roll-forward
> in the consumer project exposes bugs in the common library. There may
> be situations like:
> 
> * Kudu changes API in $COMMON that impacts Arrow
> * Arrow says +1, we will roll forward $COMMON later
> * Patch merged
> * Arrow rolls forward, discovers bug caused by patch in $COMMON
> * Arrow proposes patch to $COMMON
> * ...
> 
> This is the worst case scenario, of course, but I actually think it is
> good because it would indicate that the unit testing in $COMMON needs
> to be improved. Unit testing in the common library, therefore, would
> take on more of a "defensive" quality than currently.
> 
> In any case, I'm keen to move forward to coming up with a concrete
> plan if we can reach consensus on the particulars.
> 
> Thanks
> Wes
> 
> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
>> I also support the idea of creating an "apache commons modern c++" style
>> library, maybe tailored toward the needs of columnar data processing
>> tools.  I think APR is the wrong project but I think that *style* of
>> project is the right direction to aim.
>> 
>> I agree this adds test and release process complexity across products but I
>> think the benefits of a shared, well-tested library outweigh that, and
>> creating such test infrastructure will have long-term benefits as well.
>> 
>> I'd be happy to lend a hand wherever it's needed.
>> 
>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> Hey folks,
>>> 
>>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>> well), so we've been chatting periodically offline about the best way to do
>>> this. Having more projects potentially interested in collaborating is
>>> definitely welcome, though I think does also increase the complexity of
>>> whatever solution we come up with.
>>> 
>>> I think the potential benefits of collaboration are fairly self-evident, so
>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>> 
>>> 1) Open source release model
>>> 
>>> The ASF is very much against having projects which do not do releases. So,
>>> if we were to create some new ASF project to hold this code, we'd be
>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>> frequent releases, but we actually need at least 3 PMC members to vote on
>>> each release, and given people can come and go, we'd probably need at least
>>> 5-8 people who are actively committed to helping with the release process
>>> of this "commons" project.
>>> 
>>> Unlike our existing projects, which seem to release every 2-3 months, if
>>> that, I think this one would have to release _much_ more frequently, if we
>>> expect downstream projects to depend on released versions rather than just
>>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>>> normal voting period and process for every release, I don't think we could
>>> do something like have "daily automatic releases", etc.
>>> 
>>> We could probably campaign the ASF membership to treat this project
>>> differently, either as (a) a repository of code that never releases, in
>>> which case the "downstream" projects are responsible for vetting IP, etc,
>>> as part of their own release processes, or (b) a project which does
>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>> palatable from an IP perspective, and also from the perspective of the
>>> downstream projects.
>>> 
>>> 
>>> 2) Governance/review model
>>> 
>>> The more projects there are sharing this common code, the more difficult it
>>> is to know whether a change would break something, or even whether a change
>>> is considered desirable for all of the projects. I don't want to get into
>>> some world where any change to a central library requires a multi-week
>>> proposal/design-doc/review across 3+ different groups of committers, all of
>>> whom may have different near-term priorities. On the other hand, it would
>>> be pretty frustrating if the week before we're trying to cut a Kudu release
>>> branch, someone in another community decides to make a potentially
>>> destabilizing change to the RPC library.
>>> 
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> Semi-related to the above: we currently feel pretty confident when we make
>>> a change to a central library like kudu/util/thread.cc that nothing broke
>>> because we run the full suite of Kudu tests. Of course the central
>>> libraries have some unit test coverage, but I wouldn't be confident with
>>> any sort of model where shared code can change without verification by a
>>> larger suite of tests.
>>> 
>>> On the other hand, I also don't want to move to a model where any change to
>>> shared code requires a 6+-hour precommit spanning several projects, each of
>>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>>> can imagine that if an Arrow developer made some change to "thread.cc" and
>>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>>> how to triage it, etc. That could be a strong disincentive to continued
>>> innovation in these areas of common code, which we'll need a good way to
>>> avoid.
>>> 
>>> I think some of the above could be ameliorated with really good
>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>> significance in flakiness level, etc. But, that's a lot of infrastructure
>>> that doesn't currently exist.
>>> 
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>> Currently these common libraries are treated as components of monolithic
>>> projects. That means it's no extra overhead for us to make some kind of
>>> change which breaks an API in src/kudu/util/ and at the same time updates
>>> all call sites. The internal libraries have no semblance of API
>>> compatibility guarantees, etc, and adding one is not without cost.
>>> 
>>> Before sharing code, we should figure out how exactly we'll manage the
>>> cases where we want to make some change in a common library that breaks an
>>> API used by other projects, given there's no way to make an atomic commit
>>> across many repositories. One option is that each "user" of the libraries
>>> manually "rolls" to new versions when they feel like it, but there's still
>>> now a case where a common change "pushes work onto" the consumers to update
>>> call sites, etc.
>>> 
>>> Admittedly, the number of breaking API changes in these common libraries is
>>> relatively small, but would still be good to understand how we would plan
>>> to manage them.
>>> 
>>> -Todd
>>> 
>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> 
>>>> hi Henry,
>>>> 
>>>> Thank you for these comments.
>>>> 
>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>> ideal (though perhaps initially more labor intensive) solution.
>>>> There's code in Arrow that I would move into this project if it
>>>> existed. I am happy to help make this happen if there is interest from
>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>> be the most expedient way to establish the project, whether as an ASF
>>>> Incubator project or possibly as a new TLP that could be created by
>>>> spinning IP out of Apache Kudu.
>>>> 
>>>> I'm interested to hear the opinions of others, and possible next steps.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 
>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>>> Thanks for bringing this up, Wes.
>>>>> 
>>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>> 
>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>> apologize if this does not work well)
>>>>>> 
>>>>>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>> 
>>>>>> As context
>>>>>> 
>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>> decision to copy and paste a significant amount of code out of
>>>>>> Impala's codebase to bootstrap the project.
>>>>>> 
>>>>>> * In parallel, we begin the Apache Arrow project, which is designed to
>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>> 
>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>> interface. We recently decided in PARQUET-818
>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>> libparquet.a/so.
>>>>>> 
>>>>>> * There is still quite a bit of code in parquet-cpp that would better
>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>>>>>> compression, bit utilities, and so forth. Much of this code originated
>>>>>> from Impala
>>>>>> 
>>>>>> This brings me to a next set of points:
>>>>>> 
>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>> Impala. This is mostly self-contained in
>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>> 
>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>> diverged as the needs of the projects have evolved.
>>>>>> 
>>>>>> Since all of these projects are quite different in their end goals
>>>>>> (runtime systems vs. libraries), touching code that is tightly coupled
>>>>>> to either Kudu or Impala's runtimes is probably not worth discussing.
>>>>>> However, I think there is a strong basis for collaboration on
>>>>>> computational utilities and vectorized array processing. Some obvious
>>>>>> areas that come to mind:
>>>>>> 
>>>>>> * SIMD utilities (for hashing or processing of preallocated contiguous
>>>>>> memory)
>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>> contributed a patch to parquet-cpp around this)
>>>>>> * Date and time utilities
>>>>>> * Compression utilities
>>>>>> 
>>>>> 
>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>> for
>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list is
>>>>> quite long.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>> integration, and platform compatibility.
>>>>>> 
>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>>>>>> small, and it builds and installs fast. It is intended as a library to
>>>>>> have its headers used and linked against other applications. (As an
>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>> columnar messages into the kudu client).
>>>>>> 
>>>>> 
>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>> opinion that it only makes sense to house shared support code in a
>>>>> separate, dedicated project.
>>>>> 
>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>> scope
>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>> sense
>>>>> to add a threading library to Arrow if it was never used natively.
>>>> Muddying
>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>> developer, confusion. Similarly, we should not necessarily couple
>>> Arrow's
>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>> 
>>>>> I think I'd rather see a new Apache project than re-use a current one
>>> for
>>>>> two independent purposes.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> The downside of code sharing, which may have prevented it so far, are
>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>> went down this road I would be OK with helping the community commit to
>>>>>> a regular release cadence that would be faster than Impala, Kudu, and
>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>> collaborate to each other's mutual benefit and success.
>>>>>> 
>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>> 
>>>>>> If this is something that either the Kudu or Impala communities would
>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>> steps. I would suggest that we start with something small so that we
>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>> workflow for moving around code and tests, a protocol for code reviews
>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>> 
>>>>> 
>>>>> I think, if I'm reading this correctly, that you're assuming
>>> integration
>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>>>>> their toolchains. For something as fast moving as utility code - and
>>>>> critical, where you want the latency between adding a fix and including
>>>> it
>>>>> in your build to be ~0 - that's a non-starter to me, at least with how
>>>> the
>>>>> toolchains are currently realised.
>>>>> 
>>>>> I'd rather have the source code directly imported into Impala's tree -
>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>> looser, and we can move more quickly. I think that's important to other
>>>>> projects as well.
>>>>> 
>>>>> Henry
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Let me know what you think.
>>>>>> 
>>>>>> best
>>>>>> Wes
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> --
>> --
>> Cheers,
>> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Julian Hyde <jh...@apache.org>.

“Commons” projects are often problematic. It is difficult to tell what is in scope and out of scope. If the scope is drawn too wide, there is a real problem of orphaned features, because people contribute one feature and then disappear.

Let’s remember the Apache mantra: community over code. If you create a sustainable community, the code will get looked after. Would this project form a new community, or just a new piece of code? As I read the current proposal, it would be the intersection of some existing communities, not a new community.

I think it would take a considerable effort to create a new project and community around the idea of “c++ commons” (or is it “database-related c++ commons”?). I think you already have such a community, to a first approximation, in the Arrow project, because Kudu and Impala developers are already part of the Arrow community. There’s no reason why Arrow cannot contain new modules that have different release schedules than the rest of Arrow. As a TLP, releases are less burdensome, and can happen in a little over 3 days if the component is kept stable.

Lastly, the code is fungible. It can be marked “experimental” within Arrow and moved to another project, or into a new project, as it matures. The Apache license and the ASF CLA makes this very easy. We are doing something like this in Calcite: the Avatica sub-project [1] has a community that intersect’s with Calcite’s, is disconnected at a code level, and may over time evolve into a separate project. In the mean time, being part of an established project is helpful, because there are PMC members to vote.

Julian

[1] https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>

> On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Responding to Todd's e-mail:
> 
> 1) Open source release model
> 
> My expectation is that this library would release about once a month,
> with occasional faster releases for critical fixes.
> 
> 2) Governance/review model
> 
> Beyond having centralized code reviews, it's hard to predict how the
> governance would play out. I understand that OSS projects behave
> differently in their planning / design / review process, so work on a
> common need may require more of a negotiation than the prior
> "unilateral" process.
> 
> I think it says something for our communities that we would make a
> commitment in our collaboration on this to the success of the
> "consumer" projects. So if the Arrow or Parquet communities were
> contemplating a change that might impact Kudu, for example, it would
> be in our best interest to be careful and communicate proactively.
> 
> This all makes sense. From an Arrow and Parquet perspective, we do not
> add very much testing burden because our continuous integration suites
> do not take long to run.
> 
> 3) Pre-commit/test mechanics
> 
> One thing that would help would be community-maintained
> Dockerfiles/Docker images (or equivalent) to assist with validation
> and testing for developers.
> 
> I am happy to comply with a pre-commit testing protocol that works for
> the Kudu and Impala teams.
> 
> 4) Integration mechanics for breaking changes
> 
>> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.
> 
> Breaking API changes will create extra work, because any automated
> testing that we create will not be able to validate the patch to the
> common library. Perhaps we can configure a manual way (in Jenkins,
> say) to test two patches together.
> 
> In the event that a community member has a patch containing an API
> break that impacts a project that they are not a contributor for,
> there should be some expectation to either work with the affected
> project on a coordinated patch or obtain their +1 to merge the patch
> even though it will may require a follow up patch if the roll-forward
> in the consumer project exposes bugs in the common library. There may
> be situations like:
> 
> * Kudu changes API in $COMMON that impacts Arrow
> * Arrow says +1, we will roll forward $COMMON later
> * Patch merged
> * Arrow rolls forward, discovers bug caused by patch in $COMMON
> * Arrow proposes patch to $COMMON
> * ...
> 
> This is the worst case scenario, of course, but I actually think it is
> good because it would indicate that the unit testing in $COMMON needs
> to be improved. Unit testing in the common library, therefore, would
> take on more of a "defensive" quality than currently.
> 
> In any case, I'm keen to move forward to coming up with a concrete
> plan if we can reach consensus on the particulars.
> 
> Thanks
> Wes
> 
> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
>> I also support the idea of creating an "apache commons modern c++" style
>> library, maybe tailored toward the needs of columnar data processing
>> tools.  I think APR is the wrong project but I think that *style* of
>> project is the right direction to aim.
>> 
>> I agree this adds test and release process complexity across products but I
>> think the benefits of a shared, well-tested library outweigh that, and
>> creating such test infrastructure will have long-term benefits as well.
>> 
>> I'd be happy to lend a hand wherever it's needed.
>> 
>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> Hey folks,
>>> 
>>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>> well), so we've been chatting periodically offline about the best way to do
>>> this. Having more projects potentially interested in collaborating is
>>> definitely welcome, though I think does also increase the complexity of
>>> whatever solution we come up with.
>>> 
>>> I think the potential benefits of collaboration are fairly self-evident, so
>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>> 
>>> 1) Open source release model
>>> 
>>> The ASF is very much against having projects which do not do releases. So,
>>> if we were to create some new ASF project to hold this code, we'd be
>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>> frequent releases, but we actually need at least 3 PMC members to vote on
>>> each release, and given people can come and go, we'd probably need at least
>>> 5-8 people who are actively committed to helping with the release process
>>> of this "commons" project.
>>> 
>>> Unlike our existing projects, which seem to release every 2-3 months, if
>>> that, I think this one would have to release _much_ more frequently, if we
>>> expect downstream projects to depend on released versions rather than just
>>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>>> normal voting period and process for every release, I don't think we could
>>> do something like have "daily automatic releases", etc.
>>> 
>>> We could probably campaign the ASF membership to treat this project
>>> differently, either as (a) a repository of code that never releases, in
>>> which case the "downstream" projects are responsible for vetting IP, etc,
>>> as part of their own release processes, or (b) a project which does
>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>> palatable from an IP perspective, and also from the perspective of the
>>> downstream projects.
>>> 
>>> 
>>> 2) Governance/review model
>>> 
>>> The more projects there are sharing this common code, the more difficult it
>>> is to know whether a change would break something, or even whether a change
>>> is considered desirable for all of the projects. I don't want to get into
>>> some world where any change to a central library requires a multi-week
>>> proposal/design-doc/review across 3+ different groups of committers, all of
>>> whom may have different near-term priorities. On the other hand, it would
>>> be pretty frustrating if the week before we're trying to cut a Kudu release
>>> branch, someone in another community decides to make a potentially
>>> destabilizing change to the RPC library.
>>> 
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> Semi-related to the above: we currently feel pretty confident when we make
>>> a change to a central library like kudu/util/thread.cc that nothing broke
>>> because we run the full suite of Kudu tests. Of course the central
>>> libraries have some unit test coverage, but I wouldn't be confident with
>>> any sort of model where shared code can change without verification by a
>>> larger suite of tests.
>>> 
>>> On the other hand, I also don't want to move to a model where any change to
>>> shared code requires a 6+-hour precommit spanning several projects, each of
>>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>>> can imagine that if an Arrow developer made some change to "thread.cc" and
>>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>>> how to triage it, etc. That could be a strong disincentive to continued
>>> innovation in these areas of common code, which we'll need a good way to
>>> avoid.
>>> 
>>> I think some of the above could be ameliorated with really good
>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>> significance in flakiness level, etc. But, that's a lot of infrastructure
>>> that doesn't currently exist.
>>> 
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>> Currently these common libraries are treated as components of monolithic
>>> projects. That means it's no extra overhead for us to make some kind of
>>> change which breaks an API in src/kudu/util/ and at the same time updates
>>> all call sites. The internal libraries have no semblance of API
>>> compatibility guarantees, etc, and adding one is not without cost.
>>> 
>>> Before sharing code, we should figure out how exactly we'll manage the
>>> cases where we want to make some change in a common library that breaks an
>>> API used by other projects, given there's no way to make an atomic commit
>>> across many repositories. One option is that each "user" of the libraries
>>> manually "rolls" to new versions when they feel like it, but there's still
>>> now a case where a common change "pushes work onto" the consumers to update
>>> call sites, etc.
>>> 
>>> Admittedly, the number of breaking API changes in these common libraries is
>>> relatively small, but would still be good to understand how we would plan
>>> to manage them.
>>> 
>>> -Todd
>>> 
>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> 
>>>> hi Henry,
>>>> 
>>>> Thank you for these comments.
>>>> 
>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>> ideal (though perhaps initially more labor intensive) solution.
>>>> There's code in Arrow that I would move into this project if it
>>>> existed. I am happy to help make this happen if there is interest from
>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>> be the most expedient way to establish the project, whether as an ASF
>>>> Incubator project or possibly as a new TLP that could be created by
>>>> spinning IP out of Apache Kudu.
>>>> 
>>>> I'm interested to hear the opinions of others, and possible next steps.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 
>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>>> Thanks for bringing this up, Wes.
>>>>> 
>>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>> 
>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>> apologize if this does not work well)
>>>>>> 
>>>>>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>> 
>>>>>> As context
>>>>>> 
>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>> decision to copy and paste a significant amount of code out of
>>>>>> Impala's codebase to bootstrap the project.
>>>>>> 
>>>>>> * In parallel, we begin the Apache Arrow project, which is designed to
>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>> 
>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>> interface. We recently decided in PARQUET-818
>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>> libparquet.a/so.
>>>>>> 
>>>>>> * There is still quite a bit of code in parquet-cpp that would better
>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>>>>>> compression, bit utilities, and so forth. Much of this code originated
>>>>>> from Impala
>>>>>> 
>>>>>> This brings me to a next set of points:
>>>>>> 
>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>> Impala. This is mostly self-contained in
>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>> 
>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>> diverged as the needs of the projects have evolved.
>>>>>> 
>>>>>> Since all of these projects are quite different in their end goals
>>>>>> (runtime systems vs. libraries), touching code that is tightly coupled
>>>>>> to either Kudu or Impala's runtimes is probably not worth discussing.
>>>>>> However, I think there is a strong basis for collaboration on
>>>>>> computational utilities and vectorized array processing. Some obvious
>>>>>> areas that come to mind:
>>>>>> 
>>>>>> * SIMD utilities (for hashing or processing of preallocated contiguous
>>>>>> memory)
>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>> contributed a patch to parquet-cpp around this)
>>>>>> * Date and time utilities
>>>>>> * Compression utilities
>>>>>> 
>>>>> 
>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>> for
>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list is
>>>>> quite long.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>> integration, and platform compatibility.
>>>>>> 
>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>>>>>> small, and it builds and installs fast. It is intended as a library to
>>>>>> have its headers used and linked against other applications. (As an
>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>> columnar messages into the kudu client).
>>>>>> 
>>>>> 
>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>> opinion that it only makes sense to house shared support code in a
>>>>> separate, dedicated project.
>>>>> 
>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>> scope
>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>> sense
>>>>> to add a threading library to Arrow if it was never used natively.
>>>> Muddying
>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>> developer, confusion. Similarly, we should not necessarily couple
>>> Arrow's
>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>> 
>>>>> I think I'd rather see a new Apache project than re-use a current one
>>> for
>>>>> two independent purposes.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> The downside of code sharing, which may have prevented it so far, are
>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>> went down this road I would be OK with helping the community commit to
>>>>>> a regular release cadence that would be faster than Impala, Kudu, and
>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>> collaborate to each other's mutual benefit and success.
>>>>>> 
>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>> 
>>>>>> If this is something that either the Kudu or Impala communities would
>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>> steps. I would suggest that we start with something small so that we
>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>> workflow for moving around code and tests, a protocol for code reviews
>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>> 
>>>>> 
>>>>> I think, if I'm reading this correctly, that you're assuming
>>> integration
>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>>>>> their toolchains. For something as fast moving as utility code - and
>>>>> critical, where you want the latency between adding a fix and including
>>>> it
>>>>> in your build to be ~0 - that's a non-starter to me, at least with how
>>>> the
>>>>> toolchains are currently realised.
>>>>> 
>>>>> I'd rather have the source code directly imported into Impala's tree -
>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>> looser, and we can move more quickly. I think that's important to other
>>>>> projects as well.
>>>>> 
>>>>> Henry
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Let me know what you think.
>>>>>> 
>>>>>> best
>>>>>> Wes
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> --
>> --
>> Cheers,
>> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Julian Hyde <jh...@apache.org>.

“Commons” projects are often problematic. It is difficult to tell what is in scope and out of scope. If the scope is drawn too wide, there is a real problem of orphaned features, because people contribute one feature and then disappear.

Let’s remember the Apache mantra: community over code. If you create a sustainable community, the code will get looked after. Would this project form a new community, or just a new piece of code? As I read the current proposal, it would be the intersection of some existing communities, not a new community.

I think it would take a considerable effort to create a new project and community around the idea of “c++ commons” (or is it “database-related c++ commons”?). I think you already have such a community, to a first approximation, in the Arrow project, because Kudu and Impala developers are already part of the Arrow community. There’s no reason why Arrow cannot contain new modules that have different release schedules than the rest of Arrow. As a TLP, releases are less burdensome, and can happen in a little over 3 days if the component is kept stable.

Lastly, the code is fungible. It can be marked “experimental” within Arrow and moved to another project, or into a new project, as it matures. The Apache license and the ASF CLA makes this very easy. We are doing something like this in Calcite: the Avatica sub-project [1] has a community that intersect’s with Calcite’s, is disconnected at a code level, and may over time evolve into a separate project. In the mean time, being part of an established project is helpful, because there are PMC members to vote.

Julian

[1] https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>

> On Feb 27, 2017, at 6:41 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Responding to Todd's e-mail:
> 
> 1) Open source release model
> 
> My expectation is that this library would release about once a month,
> with occasional faster releases for critical fixes.
> 
> 2) Governance/review model
> 
> Beyond having centralized code reviews, it's hard to predict how the
> governance would play out. I understand that OSS projects behave
> differently in their planning / design / review process, so work on a
> common need may require more of a negotiation than the prior
> "unilateral" process.
> 
> I think it says something for our communities that we would make a
> commitment in our collaboration on this to the success of the
> "consumer" projects. So if the Arrow or Parquet communities were
> contemplating a change that might impact Kudu, for example, it would
> be in our best interest to be careful and communicate proactively.
> 
> This all makes sense. From an Arrow and Parquet perspective, we do not
> add very much testing burden because our continuous integration suites
> do not take long to run.
> 
> 3) Pre-commit/test mechanics
> 
> One thing that would help would be community-maintained
> Dockerfiles/Docker images (or equivalent) to assist with validation
> and testing for developers.
> 
> I am happy to comply with a pre-commit testing protocol that works for
> the Kudu and Impala teams.
> 
> 4) Integration mechanics for breaking changes
> 
>> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.
> 
> Breaking API changes will create extra work, because any automated
> testing that we create will not be able to validate the patch to the
> common library. Perhaps we can configure a manual way (in Jenkins,
> say) to test two patches together.
> 
> In the event that a community member has a patch containing an API
> break that impacts a project that they are not a contributor for,
> there should be some expectation to either work with the affected
> project on a coordinated patch or obtain their +1 to merge the patch
> even though it will may require a follow up patch if the roll-forward
> in the consumer project exposes bugs in the common library. There may
> be situations like:
> 
> * Kudu changes API in $COMMON that impacts Arrow
> * Arrow says +1, we will roll forward $COMMON later
> * Patch merged
> * Arrow rolls forward, discovers bug caused by patch in $COMMON
> * Arrow proposes patch to $COMMON
> * ...
> 
> This is the worst case scenario, of course, but I actually think it is
> good because it would indicate that the unit testing in $COMMON needs
> to be improved. Unit testing in the common library, therefore, would
> take on more of a "defensive" quality than currently.
> 
> In any case, I'm keen to move forward to coming up with a concrete
> plan if we can reach consensus on the particulars.
> 
> Thanks
> Wes
> 
> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
>> I also support the idea of creating an "apache commons modern c++" style
>> library, maybe tailored toward the needs of columnar data processing
>> tools.  I think APR is the wrong project but I think that *style* of
>> project is the right direction to aim.
>> 
>> I agree this adds test and release process complexity across products but I
>> think the benefits of a shared, well-tested library outweigh that, and
>> creating such test infrastructure will have long-term benefits as well.
>> 
>> I'd be happy to lend a hand wherever it's needed.
>> 
>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> Hey folks,
>>> 
>>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>> well), so we've been chatting periodically offline about the best way to do
>>> this. Having more projects potentially interested in collaborating is
>>> definitely welcome, though I think does also increase the complexity of
>>> whatever solution we come up with.
>>> 
>>> I think the potential benefits of collaboration are fairly self-evident, so
>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>> 
>>> 1) Open source release model
>>> 
>>> The ASF is very much against having projects which do not do releases. So,
>>> if we were to create some new ASF project to hold this code, we'd be
>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>> frequent releases, but we actually need at least 3 PMC members to vote on
>>> each release, and given people can come and go, we'd probably need at least
>>> 5-8 people who are actively committed to helping with the release process
>>> of this "commons" project.
>>> 
>>> Unlike our existing projects, which seem to release every 2-3 months, if
>>> that, I think this one would have to release _much_ more frequently, if we
>>> expect downstream projects to depend on released versions rather than just
>>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>>> normal voting period and process for every release, I don't think we could
>>> do something like have "daily automatic releases", etc.
>>> 
>>> We could probably campaign the ASF membership to treat this project
>>> differently, either as (a) a repository of code that never releases, in
>>> which case the "downstream" projects are responsible for vetting IP, etc,
>>> as part of their own release processes, or (b) a project which does
>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>> palatable from an IP perspective, and also from the perspective of the
>>> downstream projects.
>>> 
>>> 
>>> 2) Governance/review model
>>> 
>>> The more projects there are sharing this common code, the more difficult it
>>> is to know whether a change would break something, or even whether a change
>>> is considered desirable for all of the projects. I don't want to get into
>>> some world where any change to a central library requires a multi-week
>>> proposal/design-doc/review across 3+ different groups of committers, all of
>>> whom may have different near-term priorities. On the other hand, it would
>>> be pretty frustrating if the week before we're trying to cut a Kudu release
>>> branch, someone in another community decides to make a potentially
>>> destabilizing change to the RPC library.
>>> 
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> Semi-related to the above: we currently feel pretty confident when we make
>>> a change to a central library like kudu/util/thread.cc that nothing broke
>>> because we run the full suite of Kudu tests. Of course the central
>>> libraries have some unit test coverage, but I wouldn't be confident with
>>> any sort of model where shared code can change without verification by a
>>> larger suite of tests.
>>> 
>>> On the other hand, I also don't want to move to a model where any change to
>>> shared code requires a 6+-hour precommit spanning several projects, each of
>>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>>> can imagine that if an Arrow developer made some change to "thread.cc" and
>>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>>> how to triage it, etc. That could be a strong disincentive to continued
>>> innovation in these areas of common code, which we'll need a good way to
>>> avoid.
>>> 
>>> I think some of the above could be ameliorated with really good
>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>> significance in flakiness level, etc. But, that's a lot of infrastructure
>>> that doesn't currently exist.
>>> 
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>> Currently these common libraries are treated as components of monolithic
>>> projects. That means it's no extra overhead for us to make some kind of
>>> change which breaks an API in src/kudu/util/ and at the same time updates
>>> all call sites. The internal libraries have no semblance of API
>>> compatibility guarantees, etc, and adding one is not without cost.
>>> 
>>> Before sharing code, we should figure out how exactly we'll manage the
>>> cases where we want to make some change in a common library that breaks an
>>> API used by other projects, given there's no way to make an atomic commit
>>> across many repositories. One option is that each "user" of the libraries
>>> manually "rolls" to new versions when they feel like it, but there's still
>>> now a case where a common change "pushes work onto" the consumers to update
>>> call sites, etc.
>>> 
>>> Admittedly, the number of breaking API changes in these common libraries is
>>> relatively small, but would still be good to understand how we would plan
>>> to manage them.
>>> 
>>> -Todd
>>> 
>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> 
>>>> hi Henry,
>>>> 
>>>> Thank you for these comments.
>>>> 
>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>> ideal (though perhaps initially more labor intensive) solution.
>>>> There's code in Arrow that I would move into this project if it
>>>> existed. I am happy to help make this happen if there is interest from
>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>> be the most expedient way to establish the project, whether as an ASF
>>>> Incubator project or possibly as a new TLP that could be created by
>>>> spinning IP out of Apache Kudu.
>>>> 
>>>> I'm interested to hear the opinions of others, and possible next steps.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 
>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>>> Thanks for bringing this up, Wes.
>>>>> 
>>>>> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>> 
>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>> apologize if this does not work well)
>>>>>> 
>>>>>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>> 
>>>>>> As context
>>>>>> 
>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>> decision to copy and paste a significant amount of code out of
>>>>>> Impala's codebase to bootstrap the project.
>>>>>> 
>>>>>> * In parallel, we begin the Apache Arrow project, which is designed to
>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>> 
>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>> interface. We recently decided in PARQUET-818
>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>> libparquet.a/so.
>>>>>> 
>>>>>> * There is still quite a bit of code in parquet-cpp that would better
>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>>>>>> compression, bit utilities, and so forth. Much of this code originated
>>>>>> from Impala
>>>>>> 
>>>>>> This brings me to a next set of points:
>>>>>> 
>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>> Impala. This is mostly self-contained in
>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>> 
>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>> diverged as the needs of the projects have evolved.
>>>>>> 
>>>>>> Since all of these projects are quite different in their end goals
>>>>>> (runtime systems vs. libraries), touching code that is tightly coupled
>>>>>> to either Kudu or Impala's runtimes is probably not worth discussing.
>>>>>> However, I think there is a strong basis for collaboration on
>>>>>> computational utilities and vectorized array processing. Some obvious
>>>>>> areas that come to mind:
>>>>>> 
>>>>>> * SIMD utilities (for hashing or processing of preallocated contiguous
>>>>>> memory)
>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>> contributed a patch to parquet-cpp around this)
>>>>>> * Date and time utilities
>>>>>> * Compression utilities
>>>>>> 
>>>>> 
>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>> for
>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list is
>>>>> quite long.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>> integration, and platform compatibility.
>>>>>> 
>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>>>>>> small, and it builds and installs fast. It is intended as a library to
>>>>>> have its headers used and linked against other applications. (As an
>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>> columnar messages into the kudu client).
>>>>>> 
>>>>> 
>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>> opinion that it only makes sense to house shared support code in a
>>>>> separate, dedicated project.
>>>>> 
>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>> scope
>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>> sense
>>>>> to add a threading library to Arrow if it was never used natively.
>>>> Muddying
>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>> developer, confusion. Similarly, we should not necessarily couple
>>> Arrow's
>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>> 
>>>>> I think I'd rather see a new Apache project than re-use a current one
>>> for
>>>>> two independent purposes.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> The downside of code sharing, which may have prevented it so far, are
>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>> went down this road I would be OK with helping the community commit to
>>>>>> a regular release cadence that would be faster than Impala, Kudu, and
>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>> collaborate to each other's mutual benefit and success.
>>>>>> 
>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>> 
>>>>>> If this is something that either the Kudu or Impala communities would
>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>> steps. I would suggest that we start with something small so that we
>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>> workflow for moving around code and tests, a protocol for code reviews
>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>> 
>>>>> 
>>>>> I think, if I'm reading this correctly, that you're assuming
>>> integration
>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>>>>> their toolchains. For something as fast moving as utility code - and
>>>>> critical, where you want the latency between adding a fix and including
>>>> it
>>>>> in your build to be ~0 - that's a non-starter to me, at least with how
>>>> the
>>>>> toolchains are currently realised.
>>>>> 
>>>>> I'd rather have the source code directly imported into Impala's tree -
>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>> looser, and we can move more quickly. I think that's important to other
>>>>> projects as well.
>>>>> 
>>>>> Henry
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Let me know what you think.
>>>>>> 
>>>>>> best
>>>>>> Wes
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> --
>> --
>> Cheers,
>> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

Responding to Todd's e-mail:

1) Open source release model

My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every release, I don't think we could
>> do something like have "daily automatic releases", etc.
>>
>> We could probably campaign the ASF membership to treat this project
>> differently, either as (a) a repository of code that never releases, in
>> which case the "downstream" projects are responsible for vetting IP, etc,
>> as part of their own release processes, or (b) a project which does
>> automatic releases voted upon by robots. I'm guessing that (a) is more
>> palatable from an IP perspective, and also from the perspective of the
>> downstream projects.
>>
>>
>> 2) Governance/review model
>>
>> The more projects there are sharing this common code, the more difficult it
>> is to know whether a change would break something, or even whether a change
>> is considered desirable for all of the projects. I don't want to get into
>> some world where any change to a central library requires a multi-week
>> proposal/design-doc/review across 3+ different groups of committers, all of
>> whom may have different near-term priorities. On the other hand, it would
>> be pretty frustrating if the week before we're trying to cut a Kudu release
>> branch, someone in another community decides to make a potentially
>> destabilizing change to the RPC library.
>>
>>
>> 3) Pre-commit/test mechanics
>>
>> Semi-related to the above: we currently feel pretty confident when we make
>> a change to a central library like kudu/util/thread.cc that nothing broke
>> because we run the full suite of Kudu tests. Of course the central
>> libraries have some unit test coverage, but I wouldn't be confident with
>> any sort of model where shared code can change without verification by a
>> larger suite of tests.
>>
>> On the other hand, I also don't want to move to a model where any change to
>> shared code requires a 6+-hour precommit spanning several projects, each of
>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>> can imagine that if an Arrow developer made some change to "thread.cc" and
>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>> how to triage it, etc. That could be a strong disincentive to continued
>> innovation in these areas of common code, which we'll need a good way to
>> avoid.
>>
>> I think some of the above could be ameliorated with really good
>> infrastructure -- eg on a test failure, automatically re-run the failed
>> test on both pre-patch and post-patch, do a t-test to check statistical
>> significance in flakiness level, etc. But, that's a lot of infrastructure
>> that doesn't currently exist.
>>
>>
>> 4) Integration mechanics for breaking changes
>>
>> Currently these common libraries are treated as components of monolithic
>> projects. That means it's no extra overhead for us to make some kind of
>> change which breaks an API in src/kudu/util/ and at the same time updates
>> all call sites. The internal libraries have no semblance of API
>> compatibility guarantees, etc, and adding one is not without cost.
>>
>> Before sharing code, we should figure out how exactly we'll manage the
>> cases where we want to make some change in a common library that breaks an
>> API used by other projects, given there's no way to make an atomic commit
>> across many repositories. One option is that each "user" of the libraries
>> manually "rolls" to new versions when they feel like it, but there's still
>> now a case where a common change "pushes work onto" the consumers to update
>> call sites, etc.
>>
>> Admittedly, the number of breaking API changes in these common libraries is
>> relatively small, but would still be good to understand how we would plan
>> to manage them.
>>
>> -Todd
>>
>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>
>> > hi Henry,
>> >
>> > Thank you for these comments.
>> >
>> > I think having a kind of "Apache Commons for [Modern] C++" would be an
>> > ideal (though perhaps initially more labor intensive) solution.
>> > There's code in Arrow that I would move into this project if it
>> > existed. I am happy to help make this happen if there is interest from
>> > the Kudu and Impala communities. I am not sure logistically what would
>> > be the most expedient way to establish the project, whether as an ASF
>> > Incubator project or possibly as a new TLP that could be created by
>> > spinning IP out of Apache Kudu.
>> >
>> > I'm interested to hear the opinions of others, and possible next steps.
>> >
>> > Thanks
>> > Wes
>> >
>> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>> wrote:
>> > > Thanks for bringing this up, Wes.
>> > >
>> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> > >>
>> > >> (I'm not sure the best way to have a cross-list discussion, so I
>> > >> apologize if this does not work well)
>> > >>
>> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> > >> between the codebases in Apache Arrow and Apache Parquet, and
>> > >> opportunities for more code sharing with Kudu and Impala as well.
>> > >>
>> > >> As context
>> > >>
>> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> > >> first C++ release within Apache Parquet. I got involved with this
>> > >> project a little over a year ago and was faced with the unpleasant
>> > >> decision to copy and paste a significant amount of code out of
>> > >> Impala's codebase to bootstrap the project.
>> > >>
>> > >> * In parallel, we begin the Apache Arrow project, which is designed to
>> > >> be a complementary library for file formats (like Parquet), storage
>> > >> engines (like Kudu), and compute engines (like Impala and pandas).
>> > >>
>> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> > >> overlap crept up surrounding buffer memory management and IO
>> > >> interface. We recently decided in PARQUET-818
>> > >> (https://github.com/apache/parquet-cpp/commit/
>> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> > >> to remove some of the obvious code overlap in Parquet and make
>> > >> libarrow.a/so a hard compile and link-time dependency for
>> > >> libparquet.a/so.
>> > >>
>> > >> * There is still quite a bit of code in parquet-cpp that would better
>> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> > >> compression, bit utilities, and so forth. Much of this code originated
>> > >> from Impala
>> > >>
>> > >> This brings me to a next set of points:
>> > >>
>> > >> * parquet-cpp contains quite a bit of code that was extracted from
>> > >> Impala. This is mostly self-contained in
>> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> > >>
>> > >> * My understanding is that Kudu extracted certain computational
>> > >> utilities from Impala in its early days, but these tools have likely
>> > >> diverged as the needs of the projects have evolved.
>> > >>
>> > >> Since all of these projects are quite different in their end goals
>> > >> (runtime systems vs. libraries), touching code that is tightly coupled
>> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> > >> However, I think there is a strong basis for collaboration on
>> > >> computational utilities and vectorized array processing. Some obvious
>> > >> areas that come to mind:
>> > >>
>> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> > >> memory)
>> > >> * Array encoding utilities: RLE / Dictionary, etc.
>> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> > >> contributed a patch to parquet-cpp around this)
>> > >> * Date and time utilities
>> > >> * Compression utilities
>> > >>
>> > >
>> > > Between Kudu and Impala (at least) there are many more opportunities
>> for
>> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > > quite long.
>> > >
>> > >
>> > >>
>> > >> I hope the benefits are obvious: consolidating efforts on unit
>> > >> testing, benchmarking, performance optimizations, continuous
>> > >> integration, and platform compatibility.
>> > >>
>> > >> Logistically speaking, one possible avenue might be to use Apache
>> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> > >> small, and it builds and installs fast. It is intended as a library to
>> > >> have its headers used and linked against other applications. (As an
>> > >> aside, I'm very interested in building optional support for Arrow
>> > >> columnar messages into the kudu client).
>> > >>
>> > >
>> > > In principle I'm in favour of code sharing, and it seems very much in
>> > > keeping with the Apache way. However, practically speaking I'm of the
>> > > opinion that it only makes sense to house shared support code in a
>> > > separate, dedicated project.
>> > >
>> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
>> scope
>> > > of sharing to utilities that Arrow is interested in. It would make no
>> > sense
>> > > to add a threading library to Arrow if it was never used natively.
>> > Muddying
>> > > the waters of the project's charter seems likely to lead to user, and
>> > > developer, confusion. Similarly, we should not necessarily couple
>> Arrow's
>> > > design goals to those it inherits from Kudu and Impala's source code.
>> > >
>> > > I think I'd rather see a new Apache project than re-use a current one
>> for
>> > > two independent purposes.
>> > >
>> > >
>> > >>
>> > >> The downside of code sharing, which may have prevented it so far, are
>> > >> the logistics of coordinating ASF release cycles and keeping build
>> > >> toolchains in sync. It's taken us the past year to stabilize the
>> > >> design of Arrow for its intended use cases, so at this point if we
>> > >> went down this road I would be OK with helping the community commit to
>> > >> a regular release cadence that would be faster than Impala, Kudu, and
>> > >> Parquet's respective release cadences. Since members of the Kudu and
>> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> > >> collaborate to each other's mutual benefit and success.
>> > >>
>> > >> Note that Arrow does not throw C++ exceptions and similarly follows
>> > >> Google C++ style guide to the same extent at Kudu and Impala.
>> > >>
>> > >> If this is something that either the Kudu or Impala communities would
>> > >> like to pursue in earnest, I would be happy to work with you on next
>> > >> steps. I would suggest that we start with something small so that we
>> > >> could address the necessary build toolchain changes, and develop a
>> > >> workflow for moving around code and tests, a protocol for code reviews
>> > >> (e.g. Gerrit), and coordinating ASF releases.
>> > >>
>> > >
>> > > I think, if I'm reading this correctly, that you're assuming
>> integration
>> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > > their toolchains. For something as fast moving as utility code - and
>> > > critical, where you want the latency between adding a fix and including
>> > it
>> > > in your build to be ~0 - that's a non-starter to me, at least with how
>> > the
>> > > toolchains are currently realised.
>> > >
>> > > I'd rather have the source code directly imported into Impala's tree -
>> > > whether by git submodule or other mechanism. That way the coupling is
>> > > looser, and we can move more quickly. I think that's important to other
>> > > projects as well.
>> > >
>> > > Henry
>> > >
>> > >
>> > >
>> > >>
>> > >> Let me know what you think.
>> > >>
>> > >> best
>> > >> Wes
>> > >>
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
> --
> --
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

Responding to Todd's e-mail:

1) Open source release model

My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every release, I don't think we could
>> do something like have "daily automatic releases", etc.
>>
>> We could probably campaign the ASF membership to treat this project
>> differently, either as (a) a repository of code that never releases, in
>> which case the "downstream" projects are responsible for vetting IP, etc,
>> as part of their own release processes, or (b) a project which does
>> automatic releases voted upon by robots. I'm guessing that (a) is more
>> palatable from an IP perspective, and also from the perspective of the
>> downstream projects.
>>
>>
>> 2) Governance/review model
>>
>> The more projects there are sharing this common code, the more difficult it
>> is to know whether a change would break something, or even whether a change
>> is considered desirable for all of the projects. I don't want to get into
>> some world where any change to a central library requires a multi-week
>> proposal/design-doc/review across 3+ different groups of committers, all of
>> whom may have different near-term priorities. On the other hand, it would
>> be pretty frustrating if the week before we're trying to cut a Kudu release
>> branch, someone in another community decides to make a potentially
>> destabilizing change to the RPC library.
>>
>>
>> 3) Pre-commit/test mechanics
>>
>> Semi-related to the above: we currently feel pretty confident when we make
>> a change to a central library like kudu/util/thread.cc that nothing broke
>> because we run the full suite of Kudu tests. Of course the central
>> libraries have some unit test coverage, but I wouldn't be confident with
>> any sort of model where shared code can change without verification by a
>> larger suite of tests.
>>
>> On the other hand, I also don't want to move to a model where any change to
>> shared code requires a 6+-hour precommit spanning several projects, each of
>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>> can imagine that if an Arrow developer made some change to "thread.cc" and
>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>> how to triage it, etc. That could be a strong disincentive to continued
>> innovation in these areas of common code, which we'll need a good way to
>> avoid.
>>
>> I think some of the above could be ameliorated with really good
>> infrastructure -- eg on a test failure, automatically re-run the failed
>> test on both pre-patch and post-patch, do a t-test to check statistical
>> significance in flakiness level, etc. But, that's a lot of infrastructure
>> that doesn't currently exist.
>>
>>
>> 4) Integration mechanics for breaking changes
>>
>> Currently these common libraries are treated as components of monolithic
>> projects. That means it's no extra overhead for us to make some kind of
>> change which breaks an API in src/kudu/util/ and at the same time updates
>> all call sites. The internal libraries have no semblance of API
>> compatibility guarantees, etc, and adding one is not without cost.
>>
>> Before sharing code, we should figure out how exactly we'll manage the
>> cases where we want to make some change in a common library that breaks an
>> API used by other projects, given there's no way to make an atomic commit
>> across many repositories. One option is that each "user" of the libraries
>> manually "rolls" to new versions when they feel like it, but there's still
>> now a case where a common change "pushes work onto" the consumers to update
>> call sites, etc.
>>
>> Admittedly, the number of breaking API changes in these common libraries is
>> relatively small, but would still be good to understand how we would plan
>> to manage them.
>>
>> -Todd
>>
>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>
>> > hi Henry,
>> >
>> > Thank you for these comments.
>> >
>> > I think having a kind of "Apache Commons for [Modern] C++" would be an
>> > ideal (though perhaps initially more labor intensive) solution.
>> > There's code in Arrow that I would move into this project if it
>> > existed. I am happy to help make this happen if there is interest from
>> > the Kudu and Impala communities. I am not sure logistically what would
>> > be the most expedient way to establish the project, whether as an ASF
>> > Incubator project or possibly as a new TLP that could be created by
>> > spinning IP out of Apache Kudu.
>> >
>> > I'm interested to hear the opinions of others, and possible next steps.
>> >
>> > Thanks
>> > Wes
>> >
>> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>> wrote:
>> > > Thanks for bringing this up, Wes.
>> > >
>> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> > >>
>> > >> (I'm not sure the best way to have a cross-list discussion, so I
>> > >> apologize if this does not work well)
>> > >>
>> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> > >> between the codebases in Apache Arrow and Apache Parquet, and
>> > >> opportunities for more code sharing with Kudu and Impala as well.
>> > >>
>> > >> As context
>> > >>
>> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> > >> first C++ release within Apache Parquet. I got involved with this
>> > >> project a little over a year ago and was faced with the unpleasant
>> > >> decision to copy and paste a significant amount of code out of
>> > >> Impala's codebase to bootstrap the project.
>> > >>
>> > >> * In parallel, we begin the Apache Arrow project, which is designed to
>> > >> be a complementary library for file formats (like Parquet), storage
>> > >> engines (like Kudu), and compute engines (like Impala and pandas).
>> > >>
>> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> > >> overlap crept up surrounding buffer memory management and IO
>> > >> interface. We recently decided in PARQUET-818
>> > >> (https://github.com/apache/parquet-cpp/commit/
>> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> > >> to remove some of the obvious code overlap in Parquet and make
>> > >> libarrow.a/so a hard compile and link-time dependency for
>> > >> libparquet.a/so.
>> > >>
>> > >> * There is still quite a bit of code in parquet-cpp that would better
>> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> > >> compression, bit utilities, and so forth. Much of this code originated
>> > >> from Impala
>> > >>
>> > >> This brings me to a next set of points:
>> > >>
>> > >> * parquet-cpp contains quite a bit of code that was extracted from
>> > >> Impala. This is mostly self-contained in
>> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> > >>
>> > >> * My understanding is that Kudu extracted certain computational
>> > >> utilities from Impala in its early days, but these tools have likely
>> > >> diverged as the needs of the projects have evolved.
>> > >>
>> > >> Since all of these projects are quite different in their end goals
>> > >> (runtime systems vs. libraries), touching code that is tightly coupled
>> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> > >> However, I think there is a strong basis for collaboration on
>> > >> computational utilities and vectorized array processing. Some obvious
>> > >> areas that come to mind:
>> > >>
>> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> > >> memory)
>> > >> * Array encoding utilities: RLE / Dictionary, etc.
>> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> > >> contributed a patch to parquet-cpp around this)
>> > >> * Date and time utilities
>> > >> * Compression utilities
>> > >>
>> > >
>> > > Between Kudu and Impala (at least) there are many more opportunities
>> for
>> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > > quite long.
>> > >
>> > >
>> > >>
>> > >> I hope the benefits are obvious: consolidating efforts on unit
>> > >> testing, benchmarking, performance optimizations, continuous
>> > >> integration, and platform compatibility.
>> > >>
>> > >> Logistically speaking, one possible avenue might be to use Apache
>> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> > >> small, and it builds and installs fast. It is intended as a library to
>> > >> have its headers used and linked against other applications. (As an
>> > >> aside, I'm very interested in building optional support for Arrow
>> > >> columnar messages into the kudu client).
>> > >>
>> > >
>> > > In principle I'm in favour of code sharing, and it seems very much in
>> > > keeping with the Apache way. However, practically speaking I'm of the
>> > > opinion that it only makes sense to house shared support code in a
>> > > separate, dedicated project.
>> > >
>> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
>> scope
>> > > of sharing to utilities that Arrow is interested in. It would make no
>> > sense
>> > > to add a threading library to Arrow if it was never used natively.
>> > Muddying
>> > > the waters of the project's charter seems likely to lead to user, and
>> > > developer, confusion. Similarly, we should not necessarily couple
>> Arrow's
>> > > design goals to those it inherits from Kudu and Impala's source code.
>> > >
>> > > I think I'd rather see a new Apache project than re-use a current one
>> for
>> > > two independent purposes.
>> > >
>> > >
>> > >>
>> > >> The downside of code sharing, which may have prevented it so far, are
>> > >> the logistics of coordinating ASF release cycles and keeping build
>> > >> toolchains in sync. It's taken us the past year to stabilize the
>> > >> design of Arrow for its intended use cases, so at this point if we
>> > >> went down this road I would be OK with helping the community commit to
>> > >> a regular release cadence that would be faster than Impala, Kudu, and
>> > >> Parquet's respective release cadences. Since members of the Kudu and
>> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> > >> collaborate to each other's mutual benefit and success.
>> > >>
>> > >> Note that Arrow does not throw C++ exceptions and similarly follows
>> > >> Google C++ style guide to the same extent at Kudu and Impala.
>> > >>
>> > >> If this is something that either the Kudu or Impala communities would
>> > >> like to pursue in earnest, I would be happy to work with you on next
>> > >> steps. I would suggest that we start with something small so that we
>> > >> could address the necessary build toolchain changes, and develop a
>> > >> workflow for moving around code and tests, a protocol for code reviews
>> > >> (e.g. Gerrit), and coordinating ASF releases.
>> > >>
>> > >
>> > > I think, if I'm reading this correctly, that you're assuming
>> integration
>> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > > their toolchains. For something as fast moving as utility code - and
>> > > critical, where you want the latency between adding a fix and including
>> > it
>> > > in your build to be ~0 - that's a non-starter to me, at least with how
>> > the
>> > > toolchains are currently realised.
>> > >
>> > > I'd rather have the source code directly imported into Impala's tree -
>> > > whether by git submodule or other mechanism. That way the coupling is
>> > > looser, and we can move more quickly. I think that's important to other
>> > > projects as well.
>> > >
>> > > Henry
>> > >
>> > >
>> > >
>> > >>
>> > >> Let me know what you think.
>> > >>
>> > >> best
>> > >> Wes
>> > >>
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
> --
> --
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

Responding to Todd's e-mail:

1) Open source release model

My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every release, I don't think we could
>> do something like have "daily automatic releases", etc.
>>
>> We could probably campaign the ASF membership to treat this project
>> differently, either as (a) a repository of code that never releases, in
>> which case the "downstream" projects are responsible for vetting IP, etc,
>> as part of their own release processes, or (b) a project which does
>> automatic releases voted upon by robots. I'm guessing that (a) is more
>> palatable from an IP perspective, and also from the perspective of the
>> downstream projects.
>>
>>
>> 2) Governance/review model
>>
>> The more projects there are sharing this common code, the more difficult it
>> is to know whether a change would break something, or even whether a change
>> is considered desirable for all of the projects. I don't want to get into
>> some world where any change to a central library requires a multi-week
>> proposal/design-doc/review across 3+ different groups of committers, all of
>> whom may have different near-term priorities. On the other hand, it would
>> be pretty frustrating if the week before we're trying to cut a Kudu release
>> branch, someone in another community decides to make a potentially
>> destabilizing change to the RPC library.
>>
>>
>> 3) Pre-commit/test mechanics
>>
>> Semi-related to the above: we currently feel pretty confident when we make
>> a change to a central library like kudu/util/thread.cc that nothing broke
>> because we run the full suite of Kudu tests. Of course the central
>> libraries have some unit test coverage, but I wouldn't be confident with
>> any sort of model where shared code can change without verification by a
>> larger suite of tests.
>>
>> On the other hand, I also don't want to move to a model where any change to
>> shared code requires a 6+-hour precommit spanning several projects, each of
>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>> can imagine that if an Arrow developer made some change to "thread.cc" and
>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>> how to triage it, etc. That could be a strong disincentive to continued
>> innovation in these areas of common code, which we'll need a good way to
>> avoid.
>>
>> I think some of the above could be ameliorated with really good
>> infrastructure -- eg on a test failure, automatically re-run the failed
>> test on both pre-patch and post-patch, do a t-test to check statistical
>> significance in flakiness level, etc. But, that's a lot of infrastructure
>> that doesn't currently exist.
>>
>>
>> 4) Integration mechanics for breaking changes
>>
>> Currently these common libraries are treated as components of monolithic
>> projects. That means it's no extra overhead for us to make some kind of
>> change which breaks an API in src/kudu/util/ and at the same time updates
>> all call sites. The internal libraries have no semblance of API
>> compatibility guarantees, etc, and adding one is not without cost.
>>
>> Before sharing code, we should figure out how exactly we'll manage the
>> cases where we want to make some change in a common library that breaks an
>> API used by other projects, given there's no way to make an atomic commit
>> across many repositories. One option is that each "user" of the libraries
>> manually "rolls" to new versions when they feel like it, but there's still
>> now a case where a common change "pushes work onto" the consumers to update
>> call sites, etc.
>>
>> Admittedly, the number of breaking API changes in these common libraries is
>> relatively small, but would still be good to understand how we would plan
>> to manage them.
>>
>> -Todd
>>
>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>
>> > hi Henry,
>> >
>> > Thank you for these comments.
>> >
>> > I think having a kind of "Apache Commons for [Modern] C++" would be an
>> > ideal (though perhaps initially more labor intensive) solution.
>> > There's code in Arrow that I would move into this project if it
>> > existed. I am happy to help make this happen if there is interest from
>> > the Kudu and Impala communities. I am not sure logistically what would
>> > be the most expedient way to establish the project, whether as an ASF
>> > Incubator project or possibly as a new TLP that could be created by
>> > spinning IP out of Apache Kudu.
>> >
>> > I'm interested to hear the opinions of others, and possible next steps.
>> >
>> > Thanks
>> > Wes
>> >
>> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>> wrote:
>> > > Thanks for bringing this up, Wes.
>> > >
>> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> > >>
>> > >> (I'm not sure the best way to have a cross-list discussion, so I
>> > >> apologize if this does not work well)
>> > >>
>> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> > >> between the codebases in Apache Arrow and Apache Parquet, and
>> > >> opportunities for more code sharing with Kudu and Impala as well.
>> > >>
>> > >> As context
>> > >>
>> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> > >> first C++ release within Apache Parquet. I got involved with this
>> > >> project a little over a year ago and was faced with the unpleasant
>> > >> decision to copy and paste a significant amount of code out of
>> > >> Impala's codebase to bootstrap the project.
>> > >>
>> > >> * In parallel, we begin the Apache Arrow project, which is designed to
>> > >> be a complementary library for file formats (like Parquet), storage
>> > >> engines (like Kudu), and compute engines (like Impala and pandas).
>> > >>
>> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> > >> overlap crept up surrounding buffer memory management and IO
>> > >> interface. We recently decided in PARQUET-818
>> > >> (https://github.com/apache/parquet-cpp/commit/
>> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> > >> to remove some of the obvious code overlap in Parquet and make
>> > >> libarrow.a/so a hard compile and link-time dependency for
>> > >> libparquet.a/so.
>> > >>
>> > >> * There is still quite a bit of code in parquet-cpp that would better
>> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> > >> compression, bit utilities, and so forth. Much of this code originated
>> > >> from Impala
>> > >>
>> > >> This brings me to a next set of points:
>> > >>
>> > >> * parquet-cpp contains quite a bit of code that was extracted from
>> > >> Impala. This is mostly self-contained in
>> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> > >>
>> > >> * My understanding is that Kudu extracted certain computational
>> > >> utilities from Impala in its early days, but these tools have likely
>> > >> diverged as the needs of the projects have evolved.
>> > >>
>> > >> Since all of these projects are quite different in their end goals
>> > >> (runtime systems vs. libraries), touching code that is tightly coupled
>> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> > >> However, I think there is a strong basis for collaboration on
>> > >> computational utilities and vectorized array processing. Some obvious
>> > >> areas that come to mind:
>> > >>
>> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> > >> memory)
>> > >> * Array encoding utilities: RLE / Dictionary, etc.
>> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> > >> contributed a patch to parquet-cpp around this)
>> > >> * Date and time utilities
>> > >> * Compression utilities
>> > >>
>> > >
>> > > Between Kudu and Impala (at least) there are many more opportunities
>> for
>> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > > quite long.
>> > >
>> > >
>> > >>
>> > >> I hope the benefits are obvious: consolidating efforts on unit
>> > >> testing, benchmarking, performance optimizations, continuous
>> > >> integration, and platform compatibility.
>> > >>
>> > >> Logistically speaking, one possible avenue might be to use Apache
>> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> > >> small, and it builds and installs fast. It is intended as a library to
>> > >> have its headers used and linked against other applications. (As an
>> > >> aside, I'm very interested in building optional support for Arrow
>> > >> columnar messages into the kudu client).
>> > >>
>> > >
>> > > In principle I'm in favour of code sharing, and it seems very much in
>> > > keeping with the Apache way. However, practically speaking I'm of the
>> > > opinion that it only makes sense to house shared support code in a
>> > > separate, dedicated project.
>> > >
>> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
>> scope
>> > > of sharing to utilities that Arrow is interested in. It would make no
>> > sense
>> > > to add a threading library to Arrow if it was never used natively.
>> > Muddying
>> > > the waters of the project's charter seems likely to lead to user, and
>> > > developer, confusion. Similarly, we should not necessarily couple
>> Arrow's
>> > > design goals to those it inherits from Kudu and Impala's source code.
>> > >
>> > > I think I'd rather see a new Apache project than re-use a current one
>> for
>> > > two independent purposes.
>> > >
>> > >
>> > >>
>> > >> The downside of code sharing, which may have prevented it so far, are
>> > >> the logistics of coordinating ASF release cycles and keeping build
>> > >> toolchains in sync. It's taken us the past year to stabilize the
>> > >> design of Arrow for its intended use cases, so at this point if we
>> > >> went down this road I would be OK with helping the community commit to
>> > >> a regular release cadence that would be faster than Impala, Kudu, and
>> > >> Parquet's respective release cadences. Since members of the Kudu and
>> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> > >> collaborate to each other's mutual benefit and success.
>> > >>
>> > >> Note that Arrow does not throw C++ exceptions and similarly follows
>> > >> Google C++ style guide to the same extent at Kudu and Impala.
>> > >>
>> > >> If this is something that either the Kudu or Impala communities would
>> > >> like to pursue in earnest, I would be happy to work with you on next
>> > >> steps. I would suggest that we start with something small so that we
>> > >> could address the necessary build toolchain changes, and develop a
>> > >> workflow for moving around code and tests, a protocol for code reviews
>> > >> (e.g. Gerrit), and coordinating ASF releases.
>> > >>
>> > >
>> > > I think, if I'm reading this correctly, that you're assuming
>> integration
>> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > > their toolchains. For something as fast moving as utility code - and
>> > > critical, where you want the latency between adding a fix and including
>> > it
>> > > in your build to be ~0 - that's a non-starter to me, at least with how
>> > the
>> > > toolchains are currently realised.
>> > >
>> > > I'd rather have the source code directly imported into Impala's tree -
>> > > whether by git submodule or other mechanism. That way the coupling is
>> > > looser, and we can move more quickly. I think that's important to other
>> > > projects as well.
>> > >
>> > > Henry
>> > >
>> > >
>> > >
>> > >>
>> > >> Let me know what you think.
>> > >>
>> > >> best
>> > >> Wes
>> > >>
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
> --
> --
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

Responding to Todd's e-mail:

1) Open source release model

My expectation is that this library would release about once a month,
with occasional faster releases for critical fixes.

2) Governance/review model

Beyond having centralized code reviews, it's hard to predict how the
governance would play out. I understand that OSS projects behave
differently in their planning / design / review process, so work on a
common need may require more of a negotiation than the prior
"unilateral" process.

I think it says something for our communities that we would make a
commitment in our collaboration on this to the success of the
"consumer" projects. So if the Arrow or Parquet communities were
contemplating a change that might impact Kudu, for example, it would
be in our best interest to be careful and communicate proactively.

This all makes sense. From an Arrow and Parquet perspective, we do not
add very much testing burden because our continuous integration suites
do not take long to run.

3) Pre-commit/test mechanics

One thing that would help would be community-maintained
Dockerfiles/Docker images (or equivalent) to assist with validation
and testing for developers.

I am happy to comply with a pre-commit testing protocol that works for
the Kudu and Impala teams.

4) Integration mechanics for breaking changes

> One option is that each "user" of the libraries manually "rolls" to new versions when they feel like it, but there's still now a case where a common change "pushes work onto" the consumers to update call sites, etc.

Breaking API changes will create extra work, because any automated
testing that we create will not be able to validate the patch to the
common library. Perhaps we can configure a manual way (in Jenkins,
say) to test two patches together.

In the event that a community member has a patch containing an API
break that impacts a project that they are not a contributor for,
there should be some expectation to either work with the affected
project on a coordinated patch or obtain their +1 to merge the patch
even though it will may require a follow up patch if the roll-forward
in the consumer project exposes bugs in the common library. There may
be situations like:

* Kudu changes API in $COMMON that impacts Arrow
* Arrow says +1, we will roll forward $COMMON later
* Patch merged
* Arrow rolls forward, discovers bug caused by patch in $COMMON
* Arrow proposes patch to $COMMON
* ...

This is the worst case scenario, of course, but I actually think it is
good because it would indicate that the unit testing in $COMMON needs
to be improved. Unit testing in the common library, therefore, would
take on more of a "defensive" quality than currently.

In any case, I'm keen to move forward to coming up with a concrete
plan if we can reach consensus on the particulars.

Thanks
Wes

On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <le...@gmail.com> wrote:
> I also support the idea of creating an "apache commons modern c++" style
> library, maybe tailored toward the needs of columnar data processing
> tools.  I think APR is the wrong project but I think that *style* of
> project is the right direction to aim.
>
> I agree this adds test and release process complexity across products but I
> think the benefits of a shared, well-tested library outweigh that, and
> creating such test infrastructure will have long-term benefits as well.
>
> I'd be happy to lend a hand wherever it's needed.
>
> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey folks,
>>
>> As Henry mentioned, Impala is starting to share more code with Kudu (most
>> notably our RPC system, but that pulls in a fair bit of utility code as
>> well), so we've been chatting periodically offline about the best way to do
>> this. Having more projects potentially interested in collaborating is
>> definitely welcome, though I think does also increase the complexity of
>> whatever solution we come up with.
>>
>> I think the potential benefits of collaboration are fairly self-evident, so
>> I'll focus on my concerns here, which somewhat echo Henry's.
>>
>> 1) Open source release model
>>
>> The ASF is very much against having projects which do not do releases. So,
>> if we were to create some new ASF project to hold this code, we'd be
>> expected to do frequent releases thereof. Wes volunteered above to lead
>> frequent releases, but we actually need at least 3 PMC members to vote on
>> each release, and given people can come and go, we'd probably need at least
>> 5-8 people who are actively committed to helping with the release process
>> of this "commons" project.
>>
>> Unlike our existing projects, which seem to release every 2-3 months, if
>> that, I think this one would have to release _much_ more frequently, if we
>> expect downstream projects to depend on released versions rather than just
>> pulling in some recent (or even trunk) git hash. Since the ASF requires the
>> normal voting period and process for every release, I don't think we could
>> do something like have "daily automatic releases", etc.
>>
>> We could probably campaign the ASF membership to treat this project
>> differently, either as (a) a repository of code that never releases, in
>> which case the "downstream" projects are responsible for vetting IP, etc,
>> as part of their own release processes, or (b) a project which does
>> automatic releases voted upon by robots. I'm guessing that (a) is more
>> palatable from an IP perspective, and also from the perspective of the
>> downstream projects.
>>
>>
>> 2) Governance/review model
>>
>> The more projects there are sharing this common code, the more difficult it
>> is to know whether a change would break something, or even whether a change
>> is considered desirable for all of the projects. I don't want to get into
>> some world where any change to a central library requires a multi-week
>> proposal/design-doc/review across 3+ different groups of committers, all of
>> whom may have different near-term priorities. On the other hand, it would
>> be pretty frustrating if the week before we're trying to cut a Kudu release
>> branch, someone in another community decides to make a potentially
>> destabilizing change to the RPC library.
>>
>>
>> 3) Pre-commit/test mechanics
>>
>> Semi-related to the above: we currently feel pretty confident when we make
>> a change to a central library like kudu/util/thread.cc that nothing broke
>> because we run the full suite of Kudu tests. Of course the central
>> libraries have some unit test coverage, but I wouldn't be confident with
>> any sort of model where shared code can change without verification by a
>> larger suite of tests.
>>
>> On the other hand, I also don't want to move to a model where any change to
>> shared code requires a 6+-hour precommit spanning several projects, each of
>> which may have its own set of potentially-flaky pre-commit tests, etc. I
>> can imagine that if an Arrow developer made some change to "thread.cc" and
>> saw that TabletServerStressTest failed their precommit, they'd have no idea
>> how to triage it, etc. That could be a strong disincentive to continued
>> innovation in these areas of common code, which we'll need a good way to
>> avoid.
>>
>> I think some of the above could be ameliorated with really good
>> infrastructure -- eg on a test failure, automatically re-run the failed
>> test on both pre-patch and post-patch, do a t-test to check statistical
>> significance in flakiness level, etc. But, that's a lot of infrastructure
>> that doesn't currently exist.
>>
>>
>> 4) Integration mechanics for breaking changes
>>
>> Currently these common libraries are treated as components of monolithic
>> projects. That means it's no extra overhead for us to make some kind of
>> change which breaks an API in src/kudu/util/ and at the same time updates
>> all call sites. The internal libraries have no semblance of API
>> compatibility guarantees, etc, and adding one is not without cost.
>>
>> Before sharing code, we should figure out how exactly we'll manage the
>> cases where we want to make some change in a common library that breaks an
>> API used by other projects, given there's no way to make an atomic commit
>> across many repositories. One option is that each "user" of the libraries
>> manually "rolls" to new versions when they feel like it, but there's still
>> now a case where a common change "pushes work onto" the consumers to update
>> call sites, etc.
>>
>> Admittedly, the number of breaking API changes in these common libraries is
>> relatively small, but would still be good to understand how we would plan
>> to manage them.
>>
>> -Todd
>>
>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>
>> > hi Henry,
>> >
>> > Thank you for these comments.
>> >
>> > I think having a kind of "Apache Commons for [Modern] C++" would be an
>> > ideal (though perhaps initially more labor intensive) solution.
>> > There's code in Arrow that I would move into this project if it
>> > existed. I am happy to help make this happen if there is interest from
>> > the Kudu and Impala communities. I am not sure logistically what would
>> > be the most expedient way to establish the project, whether as an ASF
>> > Incubator project or possibly as a new TLP that could be created by
>> > spinning IP out of Apache Kudu.
>> >
>> > I'm interested to hear the opinions of others, and possible next steps.
>> >
>> > Thanks
>> > Wes
>> >
>> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>> wrote:
>> > > Thanks for bringing this up, Wes.
>> > >
>> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> > >>
>> > >> (I'm not sure the best way to have a cross-list discussion, so I
>> > >> apologize if this does not work well)
>> > >>
>> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> > >> between the codebases in Apache Arrow and Apache Parquet, and
>> > >> opportunities for more code sharing with Kudu and Impala as well.
>> > >>
>> > >> As context
>> > >>
>> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> > >> first C++ release within Apache Parquet. I got involved with this
>> > >> project a little over a year ago and was faced with the unpleasant
>> > >> decision to copy and paste a significant amount of code out of
>> > >> Impala's codebase to bootstrap the project.
>> > >>
>> > >> * In parallel, we begin the Apache Arrow project, which is designed to
>> > >> be a complementary library for file formats (like Parquet), storage
>> > >> engines (like Kudu), and compute engines (like Impala and pandas).
>> > >>
>> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> > >> overlap crept up surrounding buffer memory management and IO
>> > >> interface. We recently decided in PARQUET-818
>> > >> (https://github.com/apache/parquet-cpp/commit/
>> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> > >> to remove some of the obvious code overlap in Parquet and make
>> > >> libarrow.a/so a hard compile and link-time dependency for
>> > >> libparquet.a/so.
>> > >>
>> > >> * There is still quite a bit of code in parquet-cpp that would better
>> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> > >> compression, bit utilities, and so forth. Much of this code originated
>> > >> from Impala
>> > >>
>> > >> This brings me to a next set of points:
>> > >>
>> > >> * parquet-cpp contains quite a bit of code that was extracted from
>> > >> Impala. This is mostly self-contained in
>> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> > >>
>> > >> * My understanding is that Kudu extracted certain computational
>> > >> utilities from Impala in its early days, but these tools have likely
>> > >> diverged as the needs of the projects have evolved.
>> > >>
>> > >> Since all of these projects are quite different in their end goals
>> > >> (runtime systems vs. libraries), touching code that is tightly coupled
>> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> > >> However, I think there is a strong basis for collaboration on
>> > >> computational utilities and vectorized array processing. Some obvious
>> > >> areas that come to mind:
>> > >>
>> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> > >> memory)
>> > >> * Array encoding utilities: RLE / Dictionary, etc.
>> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> > >> contributed a patch to parquet-cpp around this)
>> > >> * Date and time utilities
>> > >> * Compression utilities
>> > >>
>> > >
>> > > Between Kudu and Impala (at least) there are many more opportunities
>> for
>> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > > quite long.
>> > >
>> > >
>> > >>
>> > >> I hope the benefits are obvious: consolidating efforts on unit
>> > >> testing, benchmarking, performance optimizations, continuous
>> > >> integration, and platform compatibility.
>> > >>
>> > >> Logistically speaking, one possible avenue might be to use Apache
>> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> > >> small, and it builds and installs fast. It is intended as a library to
>> > >> have its headers used and linked against other applications. (As an
>> > >> aside, I'm very interested in building optional support for Arrow
>> > >> columnar messages into the kudu client).
>> > >>
>> > >
>> > > In principle I'm in favour of code sharing, and it seems very much in
>> > > keeping with the Apache way. However, practically speaking I'm of the
>> > > opinion that it only makes sense to house shared support code in a
>> > > separate, dedicated project.
>> > >
>> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
>> scope
>> > > of sharing to utilities that Arrow is interested in. It would make no
>> > sense
>> > > to add a threading library to Arrow if it was never used natively.
>> > Muddying
>> > > the waters of the project's charter seems likely to lead to user, and
>> > > developer, confusion. Similarly, we should not necessarily couple
>> Arrow's
>> > > design goals to those it inherits from Kudu and Impala's source code.
>> > >
>> > > I think I'd rather see a new Apache project than re-use a current one
>> for
>> > > two independent purposes.
>> > >
>> > >
>> > >>
>> > >> The downside of code sharing, which may have prevented it so far, are
>> > >> the logistics of coordinating ASF release cycles and keeping build
>> > >> toolchains in sync. It's taken us the past year to stabilize the
>> > >> design of Arrow for its intended use cases, so at this point if we
>> > >> went down this road I would be OK with helping the community commit to
>> > >> a regular release cadence that would be faster than Impala, Kudu, and
>> > >> Parquet's respective release cadences. Since members of the Kudu and
>> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> > >> collaborate to each other's mutual benefit and success.
>> > >>
>> > >> Note that Arrow does not throw C++ exceptions and similarly follows
>> > >> Google C++ style guide to the same extent at Kudu and Impala.
>> > >>
>> > >> If this is something that either the Kudu or Impala communities would
>> > >> like to pursue in earnest, I would be happy to work with you on next
>> > >> steps. I would suggest that we start with something small so that we
>> > >> could address the necessary build toolchain changes, and develop a
>> > >> workflow for moving around code and tests, a protocol for code reviews
>> > >> (e.g. Gerrit), and coordinating ASF releases.
>> > >>
>> > >
>> > > I think, if I'm reading this correctly, that you're assuming
>> integration
>> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > > their toolchains. For something as fast moving as utility code - and
>> > > critical, where you want the latency between adding a fix and including
>> > it
>> > > in your build to be ~0 - that's a non-starter to me, at least with how
>> > the
>> > > toolchains are currently realised.
>> > >
>> > > I'd rather have the source code directly imported into Impala's tree -
>> > > whether by git submodule or other mechanism. That way the coupling is
>> > > looser, and we can move more quickly. I think that's important to other
>> > > projects as well.
>> > >
>> > > Henry
>> > >
>> > >
>> > >
>> > >>
>> > >> Let me know what you think.
>> > >>
>> > >> best
>> > >> Wes
>> > >>
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
> --
> --
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

I also support the idea of creating an "apache commons modern c++" style
library, maybe tailored toward the needs of columnar data processing
tools.  I think APR is the wrong project but I think that *style* of
project is the right direction to aim.

I agree this adds test and release process complexity across products but I
think the benefits of a shared, well-tested library outweigh that, and
creating such test infrastructure will have long-term benefits as well.

I'd be happy to lend a hand wherever it's needed.

On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> As Henry mentioned, Impala is starting to share more code with Kudu (most
> notably our RPC system, but that pulls in a fair bit of utility code as
> well), so we've been chatting periodically offline about the best way to do
> this. Having more projects potentially interested in collaborating is
> definitely welcome, though I think does also increase the complexity of
> whatever solution we come up with.
>
> I think the potential benefits of collaboration are fairly self-evident, so
> I'll focus on my concerns here, which somewhat echo Henry's.
>
> 1) Open source release model
>
> The ASF is very much against having projects which do not do releases. So,
> if we were to create some new ASF project to hold this code, we'd be
> expected to do frequent releases thereof. Wes volunteered above to lead
> frequent releases, but we actually need at least 3 PMC members to vote on
> each release, and given people can come and go, we'd probably need at least
> 5-8 people who are actively committed to helping with the release process
> of this "commons" project.
>
> Unlike our existing projects, which seem to release every 2-3 months, if
> that, I think this one would have to release _much_ more frequently, if we
> expect downstream projects to depend on released versions rather than just
> pulling in some recent (or even trunk) git hash. Since the ASF requires the
> normal voting period and process for every release, I don't think we could
> do something like have "daily automatic releases", etc.
>
> We could probably campaign the ASF membership to treat this project
> differently, either as (a) a repository of code that never releases, in
> which case the "downstream" projects are responsible for vetting IP, etc,
> as part of their own release processes, or (b) a project which does
> automatic releases voted upon by robots. I'm guessing that (a) is more
> palatable from an IP perspective, and also from the perspective of the
> downstream projects.
>
>
> 2) Governance/review model
>
> The more projects there are sharing this common code, the more difficult it
> is to know whether a change would break something, or even whether a change
> is considered desirable for all of the projects. I don't want to get into
> some world where any change to a central library requires a multi-week
> proposal/design-doc/review across 3+ different groups of committers, all of
> whom may have different near-term priorities. On the other hand, it would
> be pretty frustrating if the week before we're trying to cut a Kudu release
> branch, someone in another community decides to make a potentially
> destabilizing change to the RPC library.
>
>
> 3) Pre-commit/test mechanics
>
> Semi-related to the above: we currently feel pretty confident when we make
> a change to a central library like kudu/util/thread.cc that nothing broke
> because we run the full suite of Kudu tests. Of course the central
> libraries have some unit test coverage, but I wouldn't be confident with
> any sort of model where shared code can change without verification by a
> larger suite of tests.
>
> On the other hand, I also don't want to move to a model where any change to
> shared code requires a 6+-hour precommit spanning several projects, each of
> which may have its own set of potentially-flaky pre-commit tests, etc. I
> can imagine that if an Arrow developer made some change to "thread.cc" and
> saw that TabletServerStressTest failed their precommit, they'd have no idea
> how to triage it, etc. That could be a strong disincentive to continued
> innovation in these areas of common code, which we'll need a good way to
> avoid.
>
> I think some of the above could be ameliorated with really good
> infrastructure -- eg on a test failure, automatically re-run the failed
> test on both pre-patch and post-patch, do a t-test to check statistical
> significance in flakiness level, etc. But, that's a lot of infrastructure
> that doesn't currently exist.
>
>
> 4) Integration mechanics for breaking changes
>
> Currently these common libraries are treated as components of monolithic
> projects. That means it's no extra overhead for us to make some kind of
> change which breaks an API in src/kudu/util/ and at the same time updates
> all call sites. The internal libraries have no semblance of API
> compatibility guarantees, etc, and adding one is not without cost.
>
> Before sharing code, we should figure out how exactly we'll manage the
> cases where we want to make some change in a common library that breaks an
> API used by other projects, given there's no way to make an atomic commit
> across many repositories. One option is that each "user" of the libraries
> manually "rolls" to new versions when they feel like it, but there's still
> now a case where a common change "pushes work onto" the consumers to update
> call sites, etc.
>
> Admittedly, the number of breaking API changes in these common libraries is
> relatively small, but would still be good to understand how we would plan
> to manage them.
>
> -Todd
>
> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
>
> > hi Henry,
> >
> > Thank you for these comments.
> >
> > I think having a kind of "Apache Commons for [Modern] C++" would be an
> > ideal (though perhaps initially more labor intensive) solution.
> > There's code in Arrow that I would move into this project if it
> > existed. I am happy to help make this happen if there is interest from
> > the Kudu and Impala communities. I am not sure logistically what would
> > be the most expedient way to establish the project, whether as an ASF
> > Incubator project or possibly as a new TLP that could be created by
> > spinning IP out of Apache Kudu.
> >
> > I'm interested to hear the opinions of others, and possible next steps.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> wrote:
> > > Thanks for bringing this up, Wes.
> > >
> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
> > >>
> > >> (I'm not sure the best way to have a cross-list discussion, so I
> > >> apologize if this does not work well)
> > >>
> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> > >> between the codebases in Apache Arrow and Apache Parquet, and
> > >> opportunities for more code sharing with Kudu and Impala as well.
> > >>
> > >> As context
> > >>
> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> > >> first C++ release within Apache Parquet. I got involved with this
> > >> project a little over a year ago and was faced with the unpleasant
> > >> decision to copy and paste a significant amount of code out of
> > >> Impala's codebase to bootstrap the project.
> > >>
> > >> * In parallel, we begin the Apache Arrow project, which is designed to
> > >> be a complementary library for file formats (like Parquet), storage
> > >> engines (like Kudu), and compute engines (like Impala and pandas).
> > >>
> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
> > >> overlap crept up surrounding buffer memory management and IO
> > >> interface. We recently decided in PARQUET-818
> > >> (https://github.com/apache/parquet-cpp/commit/
> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> > >> to remove some of the obvious code overlap in Parquet and make
> > >> libarrow.a/so a hard compile and link-time dependency for
> > >> libparquet.a/so.
> > >>
> > >> * There is still quite a bit of code in parquet-cpp that would better
> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> > >> compression, bit utilities, and so forth. Much of this code originated
> > >> from Impala
> > >>
> > >> This brings me to a next set of points:
> > >>
> > >> * parquet-cpp contains quite a bit of code that was extracted from
> > >> Impala. This is mostly self-contained in
> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> > >>
> > >> * My understanding is that Kudu extracted certain computational
> > >> utilities from Impala in its early days, but these tools have likely
> > >> diverged as the needs of the projects have evolved.
> > >>
> > >> Since all of these projects are quite different in their end goals
> > >> (runtime systems vs. libraries), touching code that is tightly coupled
> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
> > >> However, I think there is a strong basis for collaboration on
> > >> computational utilities and vectorized array processing. Some obvious
> > >> areas that come to mind:
> > >>
> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
> > >> memory)
> > >> * Array encoding utilities: RLE / Dictionary, etc.
> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> > >> contributed a patch to parquet-cpp around this)
> > >> * Date and time utilities
> > >> * Compression utilities
> > >>
> > >
> > > Between Kudu and Impala (at least) there are many more opportunities
> for
> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > > quite long.
> > >
> > >
> > >>
> > >> I hope the benefits are obvious: consolidating efforts on unit
> > >> testing, benchmarking, performance optimizations, continuous
> > >> integration, and platform compatibility.
> > >>
> > >> Logistically speaking, one possible avenue might be to use Apache
> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> > >> small, and it builds and installs fast. It is intended as a library to
> > >> have its headers used and linked against other applications. (As an
> > >> aside, I'm very interested in building optional support for Arrow
> > >> columnar messages into the kudu client).
> > >>
> > >
> > > In principle I'm in favour of code sharing, and it seems very much in
> > > keeping with the Apache way. However, practically speaking I'm of the
> > > opinion that it only makes sense to house shared support code in a
> > > separate, dedicated project.
> > >
> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
> scope
> > > of sharing to utilities that Arrow is interested in. It would make no
> > sense
> > > to add a threading library to Arrow if it was never used natively.
> > Muddying
> > > the waters of the project's charter seems likely to lead to user, and
> > > developer, confusion. Similarly, we should not necessarily couple
> Arrow's
> > > design goals to those it inherits from Kudu and Impala's source code.
> > >
> > > I think I'd rather see a new Apache project than re-use a current one
> for
> > > two independent purposes.
> > >
> > >
> > >>
> > >> The downside of code sharing, which may have prevented it so far, are
> > >> the logistics of coordinating ASF release cycles and keeping build
> > >> toolchains in sync. It's taken us the past year to stabilize the
> > >> design of Arrow for its intended use cases, so at this point if we
> > >> went down this road I would be OK with helping the community commit to
> > >> a regular release cadence that would be faster than Impala, Kudu, and
> > >> Parquet's respective release cadences. Since members of the Kudu and
> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> > >> collaborate to each other's mutual benefit and success.
> > >>
> > >> Note that Arrow does not throw C++ exceptions and similarly follows
> > >> Google C++ style guide to the same extent at Kudu and Impala.
> > >>
> > >> If this is something that either the Kudu or Impala communities would
> > >> like to pursue in earnest, I would be happy to work with you on next
> > >> steps. I would suggest that we start with something small so that we
> > >> could address the necessary build toolchain changes, and develop a
> > >> workflow for moving around code and tests, a protocol for code reviews
> > >> (e.g. Gerrit), and coordinating ASF releases.
> > >>
> > >
> > > I think, if I'm reading this correctly, that you're assuming
> integration
> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > > their toolchains. For something as fast moving as utility code - and
> > > critical, where you want the latency between adding a fix and including
> > it
> > > in your build to be ~0 - that's a non-starter to me, at least with how
> > the
> > > toolchains are currently realised.
> > >
> > > I'd rather have the source code directly imported into Impala's tree -
> > > whether by git submodule or other mechanism. That way the coupling is
> > > looser, and we can move more quickly. I think that's important to other
> > > projects as well.
> > >
> > > Henry
> > >
> > >
> > >
> > >>
> > >> Let me know what you think.
> > >>
> > >> best
> > >> Wes
> > >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

I also support the idea of creating an "apache commons modern c++" style
library, maybe tailored toward the needs of columnar data processing
tools.  I think APR is the wrong project but I think that *style* of
project is the right direction to aim.

I agree this adds test and release process complexity across products but I
think the benefits of a shared, well-tested library outweigh that, and
creating such test infrastructure will have long-term benefits as well.

I'd be happy to lend a hand wherever it's needed.

On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> As Henry mentioned, Impala is starting to share more code with Kudu (most
> notably our RPC system, but that pulls in a fair bit of utility code as
> well), so we've been chatting periodically offline about the best way to do
> this. Having more projects potentially interested in collaborating is
> definitely welcome, though I think does also increase the complexity of
> whatever solution we come up with.
>
> I think the potential benefits of collaboration are fairly self-evident, so
> I'll focus on my concerns here, which somewhat echo Henry's.
>
> 1) Open source release model
>
> The ASF is very much against having projects which do not do releases. So,
> if we were to create some new ASF project to hold this code, we'd be
> expected to do frequent releases thereof. Wes volunteered above to lead
> frequent releases, but we actually need at least 3 PMC members to vote on
> each release, and given people can come and go, we'd probably need at least
> 5-8 people who are actively committed to helping with the release process
> of this "commons" project.
>
> Unlike our existing projects, which seem to release every 2-3 months, if
> that, I think this one would have to release _much_ more frequently, if we
> expect downstream projects to depend on released versions rather than just
> pulling in some recent (or even trunk) git hash. Since the ASF requires the
> normal voting period and process for every release, I don't think we could
> do something like have "daily automatic releases", etc.
>
> We could probably campaign the ASF membership to treat this project
> differently, either as (a) a repository of code that never releases, in
> which case the "downstream" projects are responsible for vetting IP, etc,
> as part of their own release processes, or (b) a project which does
> automatic releases voted upon by robots. I'm guessing that (a) is more
> palatable from an IP perspective, and also from the perspective of the
> downstream projects.
>
>
> 2) Governance/review model
>
> The more projects there are sharing this common code, the more difficult it
> is to know whether a change would break something, or even whether a change
> is considered desirable for all of the projects. I don't want to get into
> some world where any change to a central library requires a multi-week
> proposal/design-doc/review across 3+ different groups of committers, all of
> whom may have different near-term priorities. On the other hand, it would
> be pretty frustrating if the week before we're trying to cut a Kudu release
> branch, someone in another community decides to make a potentially
> destabilizing change to the RPC library.
>
>
> 3) Pre-commit/test mechanics
>
> Semi-related to the above: we currently feel pretty confident when we make
> a change to a central library like kudu/util/thread.cc that nothing broke
> because we run the full suite of Kudu tests. Of course the central
> libraries have some unit test coverage, but I wouldn't be confident with
> any sort of model where shared code can change without verification by a
> larger suite of tests.
>
> On the other hand, I also don't want to move to a model where any change to
> shared code requires a 6+-hour precommit spanning several projects, each of
> which may have its own set of potentially-flaky pre-commit tests, etc. I
> can imagine that if an Arrow developer made some change to "thread.cc" and
> saw that TabletServerStressTest failed their precommit, they'd have no idea
> how to triage it, etc. That could be a strong disincentive to continued
> innovation in these areas of common code, which we'll need a good way to
> avoid.
>
> I think some of the above could be ameliorated with really good
> infrastructure -- eg on a test failure, automatically re-run the failed
> test on both pre-patch and post-patch, do a t-test to check statistical
> significance in flakiness level, etc. But, that's a lot of infrastructure
> that doesn't currently exist.
>
>
> 4) Integration mechanics for breaking changes
>
> Currently these common libraries are treated as components of monolithic
> projects. That means it's no extra overhead for us to make some kind of
> change which breaks an API in src/kudu/util/ and at the same time updates
> all call sites. The internal libraries have no semblance of API
> compatibility guarantees, etc, and adding one is not without cost.
>
> Before sharing code, we should figure out how exactly we'll manage the
> cases where we want to make some change in a common library that breaks an
> API used by other projects, given there's no way to make an atomic commit
> across many repositories. One option is that each "user" of the libraries
> manually "rolls" to new versions when they feel like it, but there's still
> now a case where a common change "pushes work onto" the consumers to update
> call sites, etc.
>
> Admittedly, the number of breaking API changes in these common libraries is
> relatively small, but would still be good to understand how we would plan
> to manage them.
>
> -Todd
>
> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
>
> > hi Henry,
> >
> > Thank you for these comments.
> >
> > I think having a kind of "Apache Commons for [Modern] C++" would be an
> > ideal (though perhaps initially more labor intensive) solution.
> > There's code in Arrow that I would move into this project if it
> > existed. I am happy to help make this happen if there is interest from
> > the Kudu and Impala communities. I am not sure logistically what would
> > be the most expedient way to establish the project, whether as an ASF
> > Incubator project or possibly as a new TLP that could be created by
> > spinning IP out of Apache Kudu.
> >
> > I'm interested to hear the opinions of others, and possible next steps.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> wrote:
> > > Thanks for bringing this up, Wes.
> > >
> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
> > >>
> > >> (I'm not sure the best way to have a cross-list discussion, so I
> > >> apologize if this does not work well)
> > >>
> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> > >> between the codebases in Apache Arrow and Apache Parquet, and
> > >> opportunities for more code sharing with Kudu and Impala as well.
> > >>
> > >> As context
> > >>
> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> > >> first C++ release within Apache Parquet. I got involved with this
> > >> project a little over a year ago and was faced with the unpleasant
> > >> decision to copy and paste a significant amount of code out of
> > >> Impala's codebase to bootstrap the project.
> > >>
> > >> * In parallel, we begin the Apache Arrow project, which is designed to
> > >> be a complementary library for file formats (like Parquet), storage
> > >> engines (like Kudu), and compute engines (like Impala and pandas).
> > >>
> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
> > >> overlap crept up surrounding buffer memory management and IO
> > >> interface. We recently decided in PARQUET-818
> > >> (https://github.com/apache/parquet-cpp/commit/
> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> > >> to remove some of the obvious code overlap in Parquet and make
> > >> libarrow.a/so a hard compile and link-time dependency for
> > >> libparquet.a/so.
> > >>
> > >> * There is still quite a bit of code in parquet-cpp that would better
> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> > >> compression, bit utilities, and so forth. Much of this code originated
> > >> from Impala
> > >>
> > >> This brings me to a next set of points:
> > >>
> > >> * parquet-cpp contains quite a bit of code that was extracted from
> > >> Impala. This is mostly self-contained in
> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> > >>
> > >> * My understanding is that Kudu extracted certain computational
> > >> utilities from Impala in its early days, but these tools have likely
> > >> diverged as the needs of the projects have evolved.
> > >>
> > >> Since all of these projects are quite different in their end goals
> > >> (runtime systems vs. libraries), touching code that is tightly coupled
> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
> > >> However, I think there is a strong basis for collaboration on
> > >> computational utilities and vectorized array processing. Some obvious
> > >> areas that come to mind:
> > >>
> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
> > >> memory)
> > >> * Array encoding utilities: RLE / Dictionary, etc.
> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> > >> contributed a patch to parquet-cpp around this)
> > >> * Date and time utilities
> > >> * Compression utilities
> > >>
> > >
> > > Between Kudu and Impala (at least) there are many more opportunities
> for
> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > > quite long.
> > >
> > >
> > >>
> > >> I hope the benefits are obvious: consolidating efforts on unit
> > >> testing, benchmarking, performance optimizations, continuous
> > >> integration, and platform compatibility.
> > >>
> > >> Logistically speaking, one possible avenue might be to use Apache
> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> > >> small, and it builds and installs fast. It is intended as a library to
> > >> have its headers used and linked against other applications. (As an
> > >> aside, I'm very interested in building optional support for Arrow
> > >> columnar messages into the kudu client).
> > >>
> > >
> > > In principle I'm in favour of code sharing, and it seems very much in
> > > keeping with the Apache way. However, practically speaking I'm of the
> > > opinion that it only makes sense to house shared support code in a
> > > separate, dedicated project.
> > >
> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
> scope
> > > of sharing to utilities that Arrow is interested in. It would make no
> > sense
> > > to add a threading library to Arrow if it was never used natively.
> > Muddying
> > > the waters of the project's charter seems likely to lead to user, and
> > > developer, confusion. Similarly, we should not necessarily couple
> Arrow's
> > > design goals to those it inherits from Kudu and Impala's source code.
> > >
> > > I think I'd rather see a new Apache project than re-use a current one
> for
> > > two independent purposes.
> > >
> > >
> > >>
> > >> The downside of code sharing, which may have prevented it so far, are
> > >> the logistics of coordinating ASF release cycles and keeping build
> > >> toolchains in sync. It's taken us the past year to stabilize the
> > >> design of Arrow for its intended use cases, so at this point if we
> > >> went down this road I would be OK with helping the community commit to
> > >> a regular release cadence that would be faster than Impala, Kudu, and
> > >> Parquet's respective release cadences. Since members of the Kudu and
> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> > >> collaborate to each other's mutual benefit and success.
> > >>
> > >> Note that Arrow does not throw C++ exceptions and similarly follows
> > >> Google C++ style guide to the same extent at Kudu and Impala.
> > >>
> > >> If this is something that either the Kudu or Impala communities would
> > >> like to pursue in earnest, I would be happy to work with you on next
> > >> steps. I would suggest that we start with something small so that we
> > >> could address the necessary build toolchain changes, and develop a
> > >> workflow for moving around code and tests, a protocol for code reviews
> > >> (e.g. Gerrit), and coordinating ASF releases.
> > >>
> > >
> > > I think, if I'm reading this correctly, that you're assuming
> integration
> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > > their toolchains. For something as fast moving as utility code - and
> > > critical, where you want the latency between adding a fix and including
> > it
> > > in your build to be ~0 - that's a non-starter to me, at least with how
> > the
> > > toolchains are currently realised.
> > >
> > > I'd rather have the source code directly imported into Impala's tree -
> > > whether by git submodule or other mechanism. That way the coupling is
> > > looser, and we can move more quickly. I think that's important to other
> > > projects as well.
> > >
> > > Henry
> > >
> > >
> > >
> > >>
> > >> Let me know what you think.
> > >>
> > >> best
> > >> Wes
> > >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

I also support the idea of creating an "apache commons modern c++" style
library, maybe tailored toward the needs of columnar data processing
tools.  I think APR is the wrong project but I think that *style* of
project is the right direction to aim.

I agree this adds test and release process complexity across products but I
think the benefits of a shared, well-tested library outweigh that, and
creating such test infrastructure will have long-term benefits as well.

I'd be happy to lend a hand wherever it's needed.

On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> As Henry mentioned, Impala is starting to share more code with Kudu (most
> notably our RPC system, but that pulls in a fair bit of utility code as
> well), so we've been chatting periodically offline about the best way to do
> this. Having more projects potentially interested in collaborating is
> definitely welcome, though I think does also increase the complexity of
> whatever solution we come up with.
>
> I think the potential benefits of collaboration are fairly self-evident, so
> I'll focus on my concerns here, which somewhat echo Henry's.
>
> 1) Open source release model
>
> The ASF is very much against having projects which do not do releases. So,
> if we were to create some new ASF project to hold this code, we'd be
> expected to do frequent releases thereof. Wes volunteered above to lead
> frequent releases, but we actually need at least 3 PMC members to vote on
> each release, and given people can come and go, we'd probably need at least
> 5-8 people who are actively committed to helping with the release process
> of this "commons" project.
>
> Unlike our existing projects, which seem to release every 2-3 months, if
> that, I think this one would have to release _much_ more frequently, if we
> expect downstream projects to depend on released versions rather than just
> pulling in some recent (or even trunk) git hash. Since the ASF requires the
> normal voting period and process for every release, I don't think we could
> do something like have "daily automatic releases", etc.
>
> We could probably campaign the ASF membership to treat this project
> differently, either as (a) a repository of code that never releases, in
> which case the "downstream" projects are responsible for vetting IP, etc,
> as part of their own release processes, or (b) a project which does
> automatic releases voted upon by robots. I'm guessing that (a) is more
> palatable from an IP perspective, and also from the perspective of the
> downstream projects.
>
>
> 2) Governance/review model
>
> The more projects there are sharing this common code, the more difficult it
> is to know whether a change would break something, or even whether a change
> is considered desirable for all of the projects. I don't want to get into
> some world where any change to a central library requires a multi-week
> proposal/design-doc/review across 3+ different groups of committers, all of
> whom may have different near-term priorities. On the other hand, it would
> be pretty frustrating if the week before we're trying to cut a Kudu release
> branch, someone in another community decides to make a potentially
> destabilizing change to the RPC library.
>
>
> 3) Pre-commit/test mechanics
>
> Semi-related to the above: we currently feel pretty confident when we make
> a change to a central library like kudu/util/thread.cc that nothing broke
> because we run the full suite of Kudu tests. Of course the central
> libraries have some unit test coverage, but I wouldn't be confident with
> any sort of model where shared code can change without verification by a
> larger suite of tests.
>
> On the other hand, I also don't want to move to a model where any change to
> shared code requires a 6+-hour precommit spanning several projects, each of
> which may have its own set of potentially-flaky pre-commit tests, etc. I
> can imagine that if an Arrow developer made some change to "thread.cc" and
> saw that TabletServerStressTest failed their precommit, they'd have no idea
> how to triage it, etc. That could be a strong disincentive to continued
> innovation in these areas of common code, which we'll need a good way to
> avoid.
>
> I think some of the above could be ameliorated with really good
> infrastructure -- eg on a test failure, automatically re-run the failed
> test on both pre-patch and post-patch, do a t-test to check statistical
> significance in flakiness level, etc. But, that's a lot of infrastructure
> that doesn't currently exist.
>
>
> 4) Integration mechanics for breaking changes
>
> Currently these common libraries are treated as components of monolithic
> projects. That means it's no extra overhead for us to make some kind of
> change which breaks an API in src/kudu/util/ and at the same time updates
> all call sites. The internal libraries have no semblance of API
> compatibility guarantees, etc, and adding one is not without cost.
>
> Before sharing code, we should figure out how exactly we'll manage the
> cases where we want to make some change in a common library that breaks an
> API used by other projects, given there's no way to make an atomic commit
> across many repositories. One option is that each "user" of the libraries
> manually "rolls" to new versions when they feel like it, but there's still
> now a case where a common change "pushes work onto" the consumers to update
> call sites, etc.
>
> Admittedly, the number of breaking API changes in these common libraries is
> relatively small, but would still be good to understand how we would plan
> to manage them.
>
> -Todd
>
> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
>
> > hi Henry,
> >
> > Thank you for these comments.
> >
> > I think having a kind of "Apache Commons for [Modern] C++" would be an
> > ideal (though perhaps initially more labor intensive) solution.
> > There's code in Arrow that I would move into this project if it
> > existed. I am happy to help make this happen if there is interest from
> > the Kudu and Impala communities. I am not sure logistically what would
> > be the most expedient way to establish the project, whether as an ASF
> > Incubator project or possibly as a new TLP that could be created by
> > spinning IP out of Apache Kudu.
> >
> > I'm interested to hear the opinions of others, and possible next steps.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> wrote:
> > > Thanks for bringing this up, Wes.
> > >
> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
> > >>
> > >> (I'm not sure the best way to have a cross-list discussion, so I
> > >> apologize if this does not work well)
> > >>
> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> > >> between the codebases in Apache Arrow and Apache Parquet, and
> > >> opportunities for more code sharing with Kudu and Impala as well.
> > >>
> > >> As context
> > >>
> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> > >> first C++ release within Apache Parquet. I got involved with this
> > >> project a little over a year ago and was faced with the unpleasant
> > >> decision to copy and paste a significant amount of code out of
> > >> Impala's codebase to bootstrap the project.
> > >>
> > >> * In parallel, we begin the Apache Arrow project, which is designed to
> > >> be a complementary library for file formats (like Parquet), storage
> > >> engines (like Kudu), and compute engines (like Impala and pandas).
> > >>
> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
> > >> overlap crept up surrounding buffer memory management and IO
> > >> interface. We recently decided in PARQUET-818
> > >> (https://github.com/apache/parquet-cpp/commit/
> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> > >> to remove some of the obvious code overlap in Parquet and make
> > >> libarrow.a/so a hard compile and link-time dependency for
> > >> libparquet.a/so.
> > >>
> > >> * There is still quite a bit of code in parquet-cpp that would better
> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> > >> compression, bit utilities, and so forth. Much of this code originated
> > >> from Impala
> > >>
> > >> This brings me to a next set of points:
> > >>
> > >> * parquet-cpp contains quite a bit of code that was extracted from
> > >> Impala. This is mostly self-contained in
> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> > >>
> > >> * My understanding is that Kudu extracted certain computational
> > >> utilities from Impala in its early days, but these tools have likely
> > >> diverged as the needs of the projects have evolved.
> > >>
> > >> Since all of these projects are quite different in their end goals
> > >> (runtime systems vs. libraries), touching code that is tightly coupled
> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
> > >> However, I think there is a strong basis for collaboration on
> > >> computational utilities and vectorized array processing. Some obvious
> > >> areas that come to mind:
> > >>
> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
> > >> memory)
> > >> * Array encoding utilities: RLE / Dictionary, etc.
> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> > >> contributed a patch to parquet-cpp around this)
> > >> * Date and time utilities
> > >> * Compression utilities
> > >>
> > >
> > > Between Kudu and Impala (at least) there are many more opportunities
> for
> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > > quite long.
> > >
> > >
> > >>
> > >> I hope the benefits are obvious: consolidating efforts on unit
> > >> testing, benchmarking, performance optimizations, continuous
> > >> integration, and platform compatibility.
> > >>
> > >> Logistically speaking, one possible avenue might be to use Apache
> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> > >> small, and it builds and installs fast. It is intended as a library to
> > >> have its headers used and linked against other applications. (As an
> > >> aside, I'm very interested in building optional support for Arrow
> > >> columnar messages into the kudu client).
> > >>
> > >
> > > In principle I'm in favour of code sharing, and it seems very much in
> > > keeping with the Apache way. However, practically speaking I'm of the
> > > opinion that it only makes sense to house shared support code in a
> > > separate, dedicated project.
> > >
> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
> scope
> > > of sharing to utilities that Arrow is interested in. It would make no
> > sense
> > > to add a threading library to Arrow if it was never used natively.
> > Muddying
> > > the waters of the project's charter seems likely to lead to user, and
> > > developer, confusion. Similarly, we should not necessarily couple
> Arrow's
> > > design goals to those it inherits from Kudu and Impala's source code.
> > >
> > > I think I'd rather see a new Apache project than re-use a current one
> for
> > > two independent purposes.
> > >
> > >
> > >>
> > >> The downside of code sharing, which may have prevented it so far, are
> > >> the logistics of coordinating ASF release cycles and keeping build
> > >> toolchains in sync. It's taken us the past year to stabilize the
> > >> design of Arrow for its intended use cases, so at this point if we
> > >> went down this road I would be OK with helping the community commit to
> > >> a regular release cadence that would be faster than Impala, Kudu, and
> > >> Parquet's respective release cadences. Since members of the Kudu and
> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> > >> collaborate to each other's mutual benefit and success.
> > >>
> > >> Note that Arrow does not throw C++ exceptions and similarly follows
> > >> Google C++ style guide to the same extent at Kudu and Impala.
> > >>
> > >> If this is something that either the Kudu or Impala communities would
> > >> like to pursue in earnest, I would be happy to work with you on next
> > >> steps. I would suggest that we start with something small so that we
> > >> could address the necessary build toolchain changes, and develop a
> > >> workflow for moving around code and tests, a protocol for code reviews
> > >> (e.g. Gerrit), and coordinating ASF releases.
> > >>
> > >
> > > I think, if I'm reading this correctly, that you're assuming
> integration
> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > > their toolchains. For something as fast moving as utility code - and
> > > critical, where you want the latency between adding a fix and including
> > it
> > > in your build to be ~0 - that's a non-starter to me, at least with how
> > the
> > > toolchains are currently realised.
> > >
> > > I'd rather have the source code directly imported into Impala's tree -
> > > whether by git submodule or other mechanism. That way the coupling is
> > > looser, and we can move more quickly. I think that's important to other
> > > projects as well.
> > >
> > > Henry
> > >
> > >
> > >
> > >>
> > >> Let me know what you think.
> > >>
> > >> best
> > >> Wes
> > >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Leif Walsh <le...@gmail.com>.

I also support the idea of creating an "apache commons modern c++" style
library, maybe tailored toward the needs of columnar data processing
tools.  I think APR is the wrong project but I think that *style* of
project is the right direction to aim.

I agree this adds test and release process complexity across products but I
think the benefits of a shared, well-tested library outweigh that, and
creating such test infrastructure will have long-term benefits as well.

I'd be happy to lend a hand wherever it's needed.

On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> As Henry mentioned, Impala is starting to share more code with Kudu (most
> notably our RPC system, but that pulls in a fair bit of utility code as
> well), so we've been chatting periodically offline about the best way to do
> this. Having more projects potentially interested in collaborating is
> definitely welcome, though I think does also increase the complexity of
> whatever solution we come up with.
>
> I think the potential benefits of collaboration are fairly self-evident, so
> I'll focus on my concerns here, which somewhat echo Henry's.
>
> 1) Open source release model
>
> The ASF is very much against having projects which do not do releases. So,
> if we were to create some new ASF project to hold this code, we'd be
> expected to do frequent releases thereof. Wes volunteered above to lead
> frequent releases, but we actually need at least 3 PMC members to vote on
> each release, and given people can come and go, we'd probably need at least
> 5-8 people who are actively committed to helping with the release process
> of this "commons" project.
>
> Unlike our existing projects, which seem to release every 2-3 months, if
> that, I think this one would have to release _much_ more frequently, if we
> expect downstream projects to depend on released versions rather than just
> pulling in some recent (or even trunk) git hash. Since the ASF requires the
> normal voting period and process for every release, I don't think we could
> do something like have "daily automatic releases", etc.
>
> We could probably campaign the ASF membership to treat this project
> differently, either as (a) a repository of code that never releases, in
> which case the "downstream" projects are responsible for vetting IP, etc,
> as part of their own release processes, or (b) a project which does
> automatic releases voted upon by robots. I'm guessing that (a) is more
> palatable from an IP perspective, and also from the perspective of the
> downstream projects.
>
>
> 2) Governance/review model
>
> The more projects there are sharing this common code, the more difficult it
> is to know whether a change would break something, or even whether a change
> is considered desirable for all of the projects. I don't want to get into
> some world where any change to a central library requires a multi-week
> proposal/design-doc/review across 3+ different groups of committers, all of
> whom may have different near-term priorities. On the other hand, it would
> be pretty frustrating if the week before we're trying to cut a Kudu release
> branch, someone in another community decides to make a potentially
> destabilizing change to the RPC library.
>
>
> 3) Pre-commit/test mechanics
>
> Semi-related to the above: we currently feel pretty confident when we make
> a change to a central library like kudu/util/thread.cc that nothing broke
> because we run the full suite of Kudu tests. Of course the central
> libraries have some unit test coverage, but I wouldn't be confident with
> any sort of model where shared code can change without verification by a
> larger suite of tests.
>
> On the other hand, I also don't want to move to a model where any change to
> shared code requires a 6+-hour precommit spanning several projects, each of
> which may have its own set of potentially-flaky pre-commit tests, etc. I
> can imagine that if an Arrow developer made some change to "thread.cc" and
> saw that TabletServerStressTest failed their precommit, they'd have no idea
> how to triage it, etc. That could be a strong disincentive to continued
> innovation in these areas of common code, which we'll need a good way to
> avoid.
>
> I think some of the above could be ameliorated with really good
> infrastructure -- eg on a test failure, automatically re-run the failed
> test on both pre-patch and post-patch, do a t-test to check statistical
> significance in flakiness level, etc. But, that's a lot of infrastructure
> that doesn't currently exist.
>
>
> 4) Integration mechanics for breaking changes
>
> Currently these common libraries are treated as components of monolithic
> projects. That means it's no extra overhead for us to make some kind of
> change which breaks an API in src/kudu/util/ and at the same time updates
> all call sites. The internal libraries have no semblance of API
> compatibility guarantees, etc, and adding one is not without cost.
>
> Before sharing code, we should figure out how exactly we'll manage the
> cases where we want to make some change in a common library that breaks an
> API used by other projects, given there's no way to make an atomic commit
> across many repositories. One option is that each "user" of the libraries
> manually "rolls" to new versions when they feel like it, but there's still
> now a case where a common change "pushes work onto" the consumers to update
> call sites, etc.
>
> Admittedly, the number of breaking API changes in these common libraries is
> relatively small, but would still be good to understand how we would plan
> to manage them.
>
> -Todd
>
> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com>
> wrote:
>
> > hi Henry,
> >
> > Thank you for these comments.
> >
> > I think having a kind of "Apache Commons for [Modern] C++" would be an
> > ideal (though perhaps initially more labor intensive) solution.
> > There's code in Arrow that I would move into this project if it
> > existed. I am happy to help make this happen if there is interest from
> > the Kudu and Impala communities. I am not sure logistically what would
> > be the most expedient way to establish the project, whether as an ASF
> > Incubator project or possibly as a new TLP that could be created by
> > spinning IP out of Apache Kudu.
> >
> > I'm interested to hear the opinions of others, and possible next steps.
> >
> > Thanks
> > Wes
> >
> > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
> wrote:
> > > Thanks for bringing this up, Wes.
> > >
> > > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> Dear Apache Kudu and Apache Impala (incubating) communities,
> > >>
> > >> (I'm not sure the best way to have a cross-list discussion, so I
> > >> apologize if this does not work well)
> > >>
> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> > >> between the codebases in Apache Arrow and Apache Parquet, and
> > >> opportunities for more code sharing with Kudu and Impala as well.
> > >>
> > >> As context
> > >>
> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> > >> first C++ release within Apache Parquet. I got involved with this
> > >> project a little over a year ago and was faced with the unpleasant
> > >> decision to copy and paste a significant amount of code out of
> > >> Impala's codebase to bootstrap the project.
> > >>
> > >> * In parallel, we begin the Apache Arrow project, which is designed to
> > >> be a complementary library for file formats (like Parquet), storage
> > >> engines (like Kudu), and compute engines (like Impala and pandas).
> > >>
> > >> * As Arrow and parquet-cpp matured, an increasing amount of code
> > >> overlap crept up surrounding buffer memory management and IO
> > >> interface. We recently decided in PARQUET-818
> > >> (https://github.com/apache/parquet-cpp/commit/
> > >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> > >> to remove some of the obvious code overlap in Parquet and make
> > >> libarrow.a/so a hard compile and link-time dependency for
> > >> libparquet.a/so.
> > >>
> > >> * There is still quite a bit of code in parquet-cpp that would better
> > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> > >> compression, bit utilities, and so forth. Much of this code originated
> > >> from Impala
> > >>
> > >> This brings me to a next set of points:
> > >>
> > >> * parquet-cpp contains quite a bit of code that was extracted from
> > >> Impala. This is mostly self-contained in
> > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> > >>
> > >> * My understanding is that Kudu extracted certain computational
> > >> utilities from Impala in its early days, but these tools have likely
> > >> diverged as the needs of the projects have evolved.
> > >>
> > >> Since all of these projects are quite different in their end goals
> > >> (runtime systems vs. libraries), touching code that is tightly coupled
> > >> to either Kudu or Impala's runtimes is probably not worth discussing.
> > >> However, I think there is a strong basis for collaboration on
> > >> computational utilities and vectorized array processing. Some obvious
> > >> areas that come to mind:
> > >>
> > >> * SIMD utilities (for hashing or processing of preallocated contiguous
> > >> memory)
> > >> * Array encoding utilities: RLE / Dictionary, etc.
> > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> > >> contributed a patch to parquet-cpp around this)
> > >> * Date and time utilities
> > >> * Compression utilities
> > >>
> > >
> > > Between Kudu and Impala (at least) there are many more opportunities
> for
> > > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > > quite long.
> > >
> > >
> > >>
> > >> I hope the benefits are obvious: consolidating efforts on unit
> > >> testing, benchmarking, performance optimizations, continuous
> > >> integration, and platform compatibility.
> > >>
> > >> Logistically speaking, one possible avenue might be to use Apache
> > >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> > >> small, and it builds and installs fast. It is intended as a library to
> > >> have its headers used and linked against other applications. (As an
> > >> aside, I'm very interested in building optional support for Arrow
> > >> columnar messages into the kudu client).
> > >>
> > >
> > > In principle I'm in favour of code sharing, and it seems very much in
> > > keeping with the Apache way. However, practically speaking I'm of the
> > > opinion that it only makes sense to house shared support code in a
> > > separate, dedicated project.
> > >
> > > Embedding the shared libraries in, e.g., Arrow naturally limits the
> scope
> > > of sharing to utilities that Arrow is interested in. It would make no
> > sense
> > > to add a threading library to Arrow if it was never used natively.
> > Muddying
> > > the waters of the project's charter seems likely to lead to user, and
> > > developer, confusion. Similarly, we should not necessarily couple
> Arrow's
> > > design goals to those it inherits from Kudu and Impala's source code.
> > >
> > > I think I'd rather see a new Apache project than re-use a current one
> for
> > > two independent purposes.
> > >
> > >
> > >>
> > >> The downside of code sharing, which may have prevented it so far, are
> > >> the logistics of coordinating ASF release cycles and keeping build
> > >> toolchains in sync. It's taken us the past year to stabilize the
> > >> design of Arrow for its intended use cases, so at this point if we
> > >> went down this road I would be OK with helping the community commit to
> > >> a regular release cadence that would be faster than Impala, Kudu, and
> > >> Parquet's respective release cadences. Since members of the Kudu and
> > >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> > >> collaborate to each other's mutual benefit and success.
> > >>
> > >> Note that Arrow does not throw C++ exceptions and similarly follows
> > >> Google C++ style guide to the same extent at Kudu and Impala.
> > >>
> > >> If this is something that either the Kudu or Impala communities would
> > >> like to pursue in earnest, I would be happy to work with you on next
> > >> steps. I would suggest that we start with something small so that we
> > >> could address the necessary build toolchain changes, and develop a
> > >> workflow for moving around code and tests, a protocol for code reviews
> > >> (e.g. Gerrit), and coordinating ASF releases.
> > >>
> > >
> > > I think, if I'm reading this correctly, that you're assuming
> integration
> > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > > their toolchains. For something as fast moving as utility code - and
> > > critical, where you want the latency between adding a fix and including
> > it
> > > in your build to be ~0 - that's a non-starter to me, at least with how
> > the
> > > toolchains are currently realised.
> > >
> > > I'd rather have the source code directly imported into Impala's tree -
> > > whether by git submodule or other mechanism. That way the coupling is
> > > looser, and we can move more quickly. I think that's important to other
> > > projects as well.
> > >
> > > Henry
> > >
> > >
> > >
> > >>
> > >> Let me know what you think.
> > >>
> > >> best
> > >> Wes
> > >>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
-- 
-- 
Cheers,
Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Todd Lipcon <to...@cloudera.com>.

Hey folks,

As Henry mentioned, Impala is starting to share more code with Kudu (most
notably our RPC system, but that pulls in a fair bit of utility code as
well), so we've been chatting periodically offline about the best way to do
this. Having more projects potentially interested in collaborating is
definitely welcome, though I think does also increase the complexity of
whatever solution we come up with.

I think the potential benefits of collaboration are fairly self-evident, so
I'll focus on my concerns here, which somewhat echo Henry's.

1) Open source release model

The ASF is very much against having projects which do not do releases. So,
if we were to create some new ASF project to hold this code, we'd be
expected to do frequent releases thereof. Wes volunteered above to lead
frequent releases, but we actually need at least 3 PMC members to vote on
each release, and given people can come and go, we'd probably need at least
5-8 people who are actively committed to helping with the release process
of this "commons" project.

Unlike our existing projects, which seem to release every 2-3 months, if
that, I think this one would have to release _much_ more frequently, if we
expect downstream projects to depend on released versions rather than just
pulling in some recent (or even trunk) git hash. Since the ASF requires the
normal voting period and process for every release, I don't think we could
do something like have "daily automatic releases", etc.

We could probably campaign the ASF membership to treat this project
differently, either as (a) a repository of code that never releases, in
which case the "downstream" projects are responsible for vetting IP, etc,
as part of their own release processes, or (b) a project which does
automatic releases voted upon by robots. I'm guessing that (a) is more
palatable from an IP perspective, and also from the perspective of the
downstream projects.

2) Governance/review model

The more projects there are sharing this common code, the more difficult it
is to know whether a change would break something, or even whether a change
is considered desirable for all of the projects. I don't want to get into
some world where any change to a central library requires a multi-week
proposal/design-doc/review across 3+ different groups of committers, all of
whom may have different near-term priorities. On the other hand, it would
be pretty frustrating if the week before we're trying to cut a Kudu release
branch, someone in another community decides to make a potentially
destabilizing change to the RPC library.

3) Pre-commit/test mechanics

Semi-related to the above: we currently feel pretty confident when we make
a change to a central library like kudu/util/thread.cc that nothing broke
because we run the full suite of Kudu tests. Of course the central
libraries have some unit test coverage, but I wouldn't be confident with
any sort of model where shared code can change without verification by a
larger suite of tests.

On the other hand, I also don't want to move to a model where any change to
shared code requires a 6+-hour precommit spanning several projects, each of
which may have its own set of potentially-flaky pre-commit tests, etc. I
can imagine that if an Arrow developer made some change to "thread.cc" and
saw that TabletServerStressTest failed their precommit, they'd have no idea
how to triage it, etc. That could be a strong disincentive to continued
innovation in these areas of common code, which we'll need a good way to
avoid.

I think some of the above could be ameliorated with really good
infrastructure -- eg on a test failure, automatically re-run the failed
test on both pre-patch and post-patch, do a t-test to check statistical
significance in flakiness level, etc. But, that's a lot of infrastructure
that doesn't currently exist.

4) Integration mechanics for breaking changes

Currently these common libraries are treated as components of monolithic
projects. That means it's no extra overhead for us to make some kind of
change which breaks an API in src/kudu/util/ and at the same time updates
all call sites. The internal libraries have no semblance of API
compatibility guarantees, etc, and adding one is not without cost.

Before sharing code, we should figure out how exactly we'll manage the
cases where we want to make some change in a common library that breaks an
API used by other projects, given there's no way to make an atomic commit
across many repositories. One option is that each "user" of the libraries
manually "rolls" to new versions when they feel like it, but there's still
now a case where a common change "pushes work onto" the consumers to update
call sites, etc.

Admittedly, the number of breaking API changes in these common libraries is
relatively small, but would still be good to understand how we would plan
to manage them.

-Todd

On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Miki Tebeka <mi...@gmail.com>.

Can't some (most) of it be added to APR <https://apr.apache.org/>?

On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Miki Tebeka <mi...@gmail.com>.

Can't some (most) of it be added to APR <https://apr.apache.org/>?

On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Miki Tebeka <mi...@gmail.com>.

Can't some (most) of it be added to APR <https://apr.apache.org/>?

On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Todd Lipcon <to...@cloudera.com>.

Hey folks,

As Henry mentioned, Impala is starting to share more code with Kudu (most
notably our RPC system, but that pulls in a fair bit of utility code as
well), so we've been chatting periodically offline about the best way to do
this. Having more projects potentially interested in collaborating is
definitely welcome, though I think does also increase the complexity of
whatever solution we come up with.

I think the potential benefits of collaboration are fairly self-evident, so
I'll focus on my concerns here, which somewhat echo Henry's.

1) Open source release model

The ASF is very much against having projects which do not do releases. So,
if we were to create some new ASF project to hold this code, we'd be
expected to do frequent releases thereof. Wes volunteered above to lead
frequent releases, but we actually need at least 3 PMC members to vote on
each release, and given people can come and go, we'd probably need at least
5-8 people who are actively committed to helping with the release process
of this "commons" project.

Unlike our existing projects, which seem to release every 2-3 months, if
that, I think this one would have to release _much_ more frequently, if we
expect downstream projects to depend on released versions rather than just
pulling in some recent (or even trunk) git hash. Since the ASF requires the
normal voting period and process for every release, I don't think we could
do something like have "daily automatic releases", etc.

We could probably campaign the ASF membership to treat this project
differently, either as (a) a repository of code that never releases, in
which case the "downstream" projects are responsible for vetting IP, etc,
as part of their own release processes, or (b) a project which does
automatic releases voted upon by robots. I'm guessing that (a) is more
palatable from an IP perspective, and also from the perspective of the
downstream projects.

2) Governance/review model

The more projects there are sharing this common code, the more difficult it
is to know whether a change would break something, or even whether a change
is considered desirable for all of the projects. I don't want to get into
some world where any change to a central library requires a multi-week
proposal/design-doc/review across 3+ different groups of committers, all of
whom may have different near-term priorities. On the other hand, it would
be pretty frustrating if the week before we're trying to cut a Kudu release
branch, someone in another community decides to make a potentially
destabilizing change to the RPC library.

3) Pre-commit/test mechanics

Semi-related to the above: we currently feel pretty confident when we make
a change to a central library like kudu/util/thread.cc that nothing broke
because we run the full suite of Kudu tests. Of course the central
libraries have some unit test coverage, but I wouldn't be confident with
any sort of model where shared code can change without verification by a
larger suite of tests.

On the other hand, I also don't want to move to a model where any change to
shared code requires a 6+-hour precommit spanning several projects, each of
which may have its own set of potentially-flaky pre-commit tests, etc. I
can imagine that if an Arrow developer made some change to "thread.cc" and
saw that TabletServerStressTest failed their precommit, they'd have no idea
how to triage it, etc. That could be a strong disincentive to continued
innovation in these areas of common code, which we'll need a good way to
avoid.

I think some of the above could be ameliorated with really good
infrastructure -- eg on a test failure, automatically re-run the failed
test on both pre-patch and post-patch, do a t-test to check statistical
significance in flakiness level, etc. But, that's a lot of infrastructure
that doesn't currently exist.

4) Integration mechanics for breaking changes

Currently these common libraries are treated as components of monolithic
projects. That means it's no extra overhead for us to make some kind of
change which breaks an API in src/kudu/util/ and at the same time updates
all call sites. The internal libraries have no semblance of API
compatibility guarantees, etc, and adding one is not without cost.

Before sharing code, we should figure out how exactly we'll manage the
cases where we want to make some change in a common library that breaks an
API used by other projects, given there's no way to make an atomic commit
across many repositories. One option is that each "user" of the libraries
manually "rolls" to new versions when they feel like it, but there's still
now a case where a common change "pushes work onto" the consumers to update
call sites, etc.

Admittedly, the number of breaking API changes in these common libraries is
relatively small, but would still be good to understand how we would plan
to manage them.

-Todd

On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Todd Lipcon <to...@cloudera.com>.

Hey folks,

As Henry mentioned, Impala is starting to share more code with Kudu (most
notably our RPC system, but that pulls in a fair bit of utility code as
well), so we've been chatting periodically offline about the best way to do
this. Having more projects potentially interested in collaborating is
definitely welcome, though I think does also increase the complexity of
whatever solution we come up with.

I think the potential benefits of collaboration are fairly self-evident, so
I'll focus on my concerns here, which somewhat echo Henry's.

1) Open source release model

The ASF is very much against having projects which do not do releases. So,
if we were to create some new ASF project to hold this code, we'd be
expected to do frequent releases thereof. Wes volunteered above to lead
frequent releases, but we actually need at least 3 PMC members to vote on
each release, and given people can come and go, we'd probably need at least
5-8 people who are actively committed to helping with the release process
of this "commons" project.

Unlike our existing projects, which seem to release every 2-3 months, if
that, I think this one would have to release _much_ more frequently, if we
expect downstream projects to depend on released versions rather than just
pulling in some recent (or even trunk) git hash. Since the ASF requires the
normal voting period and process for every release, I don't think we could
do something like have "daily automatic releases", etc.

We could probably campaign the ASF membership to treat this project
differently, either as (a) a repository of code that never releases, in
which case the "downstream" projects are responsible for vetting IP, etc,
as part of their own release processes, or (b) a project which does
automatic releases voted upon by robots. I'm guessing that (a) is more
palatable from an IP perspective, and also from the perspective of the
downstream projects.

2) Governance/review model

The more projects there are sharing this common code, the more difficult it
is to know whether a change would break something, or even whether a change
is considered desirable for all of the projects. I don't want to get into
some world where any change to a central library requires a multi-week
proposal/design-doc/review across 3+ different groups of committers, all of
whom may have different near-term priorities. On the other hand, it would
be pretty frustrating if the week before we're trying to cut a Kudu release
branch, someone in another community decides to make a potentially
destabilizing change to the RPC library.

3) Pre-commit/test mechanics

Semi-related to the above: we currently feel pretty confident when we make
a change to a central library like kudu/util/thread.cc that nothing broke
because we run the full suite of Kudu tests. Of course the central
libraries have some unit test coverage, but I wouldn't be confident with
any sort of model where shared code can change without verification by a
larger suite of tests.

On the other hand, I also don't want to move to a model where any change to
shared code requires a 6+-hour precommit spanning several projects, each of
which may have its own set of potentially-flaky pre-commit tests, etc. I
can imagine that if an Arrow developer made some change to "thread.cc" and
saw that TabletServerStressTest failed their precommit, they'd have no idea
how to triage it, etc. That could be a strong disincentive to continued
innovation in these areas of common code, which we'll need a good way to
avoid.

I think some of the above could be ameliorated with really good
infrastructure -- eg on a test failure, automatically re-run the failed
test on both pre-patch and post-patch, do a t-test to check statistical
significance in flakiness level, etc. But, that's a lot of infrastructure
that doesn't currently exist.

4) Integration mechanics for breaking changes

Currently these common libraries are treated as components of monolithic
projects. That means it's no extra overhead for us to make some kind of
change which breaks an API in src/kudu/util/ and at the same time updates
all call sites. The internal libraries have no semblance of API
compatibility guarantees, etc, and adding one is not without cost.

Before sharing code, we should figure out how exactly we'll manage the
cases where we want to make some change in a common library that breaks an
API used by other projects, given there's no way to make an atomic commit
across many repositories. One option is that each "user" of the libraries
manually "rolls" to new versions when they feel like it, but there's still
now a case where a common change "pushes work onto" the consumers to update
call sites, etc.

Admittedly, the number of breaking API changes in these common libraries is
relatively small, but would still be good to understand how we would plan
to manage them.

-Todd

On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Todd Lipcon <to...@cloudera.com>.

Hey folks,

As Henry mentioned, Impala is starting to share more code with Kudu (most
notably our RPC system, but that pulls in a fair bit of utility code as
well), so we've been chatting periodically offline about the best way to do
this. Having more projects potentially interested in collaborating is
definitely welcome, though I think does also increase the complexity of
whatever solution we come up with.

I think the potential benefits of collaboration are fairly self-evident, so
I'll focus on my concerns here, which somewhat echo Henry's.

1) Open source release model

The ASF is very much against having projects which do not do releases. So,
if we were to create some new ASF project to hold this code, we'd be
expected to do frequent releases thereof. Wes volunteered above to lead
frequent releases, but we actually need at least 3 PMC members to vote on
each release, and given people can come and go, we'd probably need at least
5-8 people who are actively committed to helping with the release process
of this "commons" project.

Unlike our existing projects, which seem to release every 2-3 months, if
that, I think this one would have to release _much_ more frequently, if we
expect downstream projects to depend on released versions rather than just
pulling in some recent (or even trunk) git hash. Since the ASF requires the
normal voting period and process for every release, I don't think we could
do something like have "daily automatic releases", etc.

We could probably campaign the ASF membership to treat this project
differently, either as (a) a repository of code that never releases, in
which case the "downstream" projects are responsible for vetting IP, etc,
as part of their own release processes, or (b) a project which does
automatic releases voted upon by robots. I'm guessing that (a) is more
palatable from an IP perspective, and also from the perspective of the
downstream projects.

2) Governance/review model

The more projects there are sharing this common code, the more difficult it
is to know whether a change would break something, or even whether a change
is considered desirable for all of the projects. I don't want to get into
some world where any change to a central library requires a multi-week
proposal/design-doc/review across 3+ different groups of committers, all of
whom may have different near-term priorities. On the other hand, it would
be pretty frustrating if the week before we're trying to cut a Kudu release
branch, someone in another community decides to make a potentially
destabilizing change to the RPC library.

3) Pre-commit/test mechanics

Semi-related to the above: we currently feel pretty confident when we make
a change to a central library like kudu/util/thread.cc that nothing broke
because we run the full suite of Kudu tests. Of course the central
libraries have some unit test coverage, but I wouldn't be confident with
any sort of model where shared code can change without verification by a
larger suite of tests.

On the other hand, I also don't want to move to a model where any change to
shared code requires a 6+-hour precommit spanning several projects, each of
which may have its own set of potentially-flaky pre-commit tests, etc. I
can imagine that if an Arrow developer made some change to "thread.cc" and
saw that TabletServerStressTest failed their precommit, they'd have no idea
how to triage it, etc. That could be a strong disincentive to continued
innovation in these areas of common code, which we'll need a good way to
avoid.

I think some of the above could be ameliorated with really good
infrastructure -- eg on a test failure, automatically re-run the failed
test on both pre-patch and post-patch, do a t-test to check statistical
significance in flakiness level, etc. But, that's a lot of infrastructure
that doesn't currently exist.

4) Integration mechanics for breaking changes

Currently these common libraries are treated as components of monolithic
projects. That means it's no extra overhead for us to make some kind of
change which breaks an API in src/kudu/util/ and at the same time updates
all call sites. The internal libraries have no semblance of API
compatibility guarantees, etc, and adding one is not without cost.

Before sharing code, we should figure out how exactly we'll manage the
cases where we want to make some change in a common library that breaks an
API used by other projects, given there's no way to make an atomic commit
across many repositories. One option is that each "user" of the libraries
manually "rolls" to new versions when they feel like it, but there's still
now a case where a common change "pushes work onto" the consumers to update
call sites, etc.

Admittedly, the number of breaking API changes in these common libraries is
relatively small, but would still be good to understand how we would plan
to manage them.

-Todd

On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Henry,

Thank you for these comments.

I think having a kind of "Apache Commons for [Modern] C++" would be an
ideal (though perhaps initially more labor intensive) solution.
There's code in Arrow that I would move into this project if it
existed. I am happy to help make this happen if there is interest from
the Kudu and Impala communities. I am not sure logistically what would
be the most expedient way to establish the project, whether as an ASF
Incubator project or possibly as a new TLP that could be created by
spinning IP out of Apache Kudu.

I'm interested to hear the opinions of others, and possible next steps.

Thanks
Wes

On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> Thanks for bringing this up, Wes.
>
> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>
>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>
>> (I'm not sure the best way to have a cross-list discussion, so I
>> apologize if this does not work well)
>>
>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> between the codebases in Apache Arrow and Apache Parquet, and
>> opportunities for more code sharing with Kudu and Impala as well.
>>
>> As context
>>
>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> first C++ release within Apache Parquet. I got involved with this
>> project a little over a year ago and was faced with the unpleasant
>> decision to copy and paste a significant amount of code out of
>> Impala's codebase to bootstrap the project.
>>
>> * In parallel, we begin the Apache Arrow project, which is designed to
>> be a complementary library for file formats (like Parquet), storage
>> engines (like Kudu), and compute engines (like Impala and pandas).
>>
>> * As Arrow and parquet-cpp matured, an increasing amount of code
>> overlap crept up surrounding buffer memory management and IO
>> interface. We recently decided in PARQUET-818
>> (https://github.com/apache/parquet-cpp/commit/
>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> to remove some of the obvious code overlap in Parquet and make
>> libarrow.a/so a hard compile and link-time dependency for
>> libparquet.a/so.
>>
>> * There is still quite a bit of code in parquet-cpp that would better
>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> compression, bit utilities, and so forth. Much of this code originated
>> from Impala
>>
>> This brings me to a next set of points:
>>
>> * parquet-cpp contains quite a bit of code that was extracted from
>> Impala. This is mostly self-contained in
>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>
>> * My understanding is that Kudu extracted certain computational
>> utilities from Impala in its early days, but these tools have likely
>> diverged as the needs of the projects have evolved.
>>
>> Since all of these projects are quite different in their end goals
>> (runtime systems vs. libraries), touching code that is tightly coupled
>> to either Kudu or Impala's runtimes is probably not worth discussing.
>> However, I think there is a strong basis for collaboration on
>> computational utilities and vectorized array processing. Some obvious
>> areas that come to mind:
>>
>> * SIMD utilities (for hashing or processing of preallocated contiguous
>> memory)
>> * Array encoding utilities: RLE / Dictionary, etc.
>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> contributed a patch to parquet-cpp around this)
>> * Date and time utilities
>> * Compression utilities
>>
>
> Between Kudu and Impala (at least) there are many more opportunities for
> sharing. Threads, logging, metrics, concurrent primitives - the list is
> quite long.
>
>
>>
>> I hope the benefits are obvious: consolidating efforts on unit
>> testing, benchmarking, performance optimizations, continuous
>> integration, and platform compatibility.
>>
>> Logistically speaking, one possible avenue might be to use Apache
>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> small, and it builds and installs fast. It is intended as a library to
>> have its headers used and linked against other applications. (As an
>> aside, I'm very interested in building optional support for Arrow
>> columnar messages into the kudu client).
>>
>
> In principle I'm in favour of code sharing, and it seems very much in
> keeping with the Apache way. However, practically speaking I'm of the
> opinion that it only makes sense to house shared support code in a
> separate, dedicated project.
>
> Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> of sharing to utilities that Arrow is interested in. It would make no sense
> to add a threading library to Arrow if it was never used natively. Muddying
> the waters of the project's charter seems likely to lead to user, and
> developer, confusion. Similarly, we should not necessarily couple Arrow's
> design goals to those it inherits from Kudu and Impala's source code.
>
> I think I'd rather see a new Apache project than re-use a current one for
> two independent purposes.
>
>
>>
>> The downside of code sharing, which may have prevented it so far, are
>> the logistics of coordinating ASF release cycles and keeping build
>> toolchains in sync. It's taken us the past year to stabilize the
>> design of Arrow for its intended use cases, so at this point if we
>> went down this road I would be OK with helping the community commit to
>> a regular release cadence that would be faster than Impala, Kudu, and
>> Parquet's respective release cadences. Since members of the Kudu and
>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> collaborate to each other's mutual benefit and success.
>>
>> Note that Arrow does not throw C++ exceptions and similarly follows
>> Google C++ style guide to the same extent at Kudu and Impala.
>>
>> If this is something that either the Kudu or Impala communities would
>> like to pursue in earnest, I would be happy to work with you on next
>> steps. I would suggest that we start with something small so that we
>> could address the necessary build toolchain changes, and develop a
>> workflow for moving around code and tests, a protocol for code reviews
>> (e.g. Gerrit), and coordinating ASF releases.
>>
>
> I think, if I'm reading this correctly, that you're assuming integration
> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> their toolchains. For something as fast moving as utility code - and
> critical, where you want the latency between adding a fix and including it
> in your build to be ~0 - that's a non-starter to me, at least with how the
> toolchains are currently realised.
>
> I'd rather have the source code directly imported into Impala's tree -
> whether by git submodule or other mechanism. That way the coupling is
> looser, and we can move more quickly. I think that's important to other
> projects as well.
>
> Henry
>
>
>
>>
>> Let me know what you think.
>>
>> best
>> Wes
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Henry,

Thank you for these comments.

I think having a kind of "Apache Commons for [Modern] C++" would be an
ideal (though perhaps initially more labor intensive) solution.
There's code in Arrow that I would move into this project if it
existed. I am happy to help make this happen if there is interest from
the Kudu and Impala communities. I am not sure logistically what would
be the most expedient way to establish the project, whether as an ASF
Incubator project or possibly as a new TLP that could be created by
spinning IP out of Apache Kudu.

I'm interested to hear the opinions of others, and possible next steps.

Thanks
Wes

On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> Thanks for bringing this up, Wes.
>
> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>
>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>
>> (I'm not sure the best way to have a cross-list discussion, so I
>> apologize if this does not work well)
>>
>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> between the codebases in Apache Arrow and Apache Parquet, and
>> opportunities for more code sharing with Kudu and Impala as well.
>>
>> As context
>>
>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> first C++ release within Apache Parquet. I got involved with this
>> project a little over a year ago and was faced with the unpleasant
>> decision to copy and paste a significant amount of code out of
>> Impala's codebase to bootstrap the project.
>>
>> * In parallel, we begin the Apache Arrow project, which is designed to
>> be a complementary library for file formats (like Parquet), storage
>> engines (like Kudu), and compute engines (like Impala and pandas).
>>
>> * As Arrow and parquet-cpp matured, an increasing amount of code
>> overlap crept up surrounding buffer memory management and IO
>> interface. We recently decided in PARQUET-818
>> (https://github.com/apache/parquet-cpp/commit/
>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> to remove some of the obvious code overlap in Parquet and make
>> libarrow.a/so a hard compile and link-time dependency for
>> libparquet.a/so.
>>
>> * There is still quite a bit of code in parquet-cpp that would better
>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> compression, bit utilities, and so forth. Much of this code originated
>> from Impala
>>
>> This brings me to a next set of points:
>>
>> * parquet-cpp contains quite a bit of code that was extracted from
>> Impala. This is mostly self-contained in
>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>
>> * My understanding is that Kudu extracted certain computational
>> utilities from Impala in its early days, but these tools have likely
>> diverged as the needs of the projects have evolved.
>>
>> Since all of these projects are quite different in their end goals
>> (runtime systems vs. libraries), touching code that is tightly coupled
>> to either Kudu or Impala's runtimes is probably not worth discussing.
>> However, I think there is a strong basis for collaboration on
>> computational utilities and vectorized array processing. Some obvious
>> areas that come to mind:
>>
>> * SIMD utilities (for hashing or processing of preallocated contiguous
>> memory)
>> * Array encoding utilities: RLE / Dictionary, etc.
>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> contributed a patch to parquet-cpp around this)
>> * Date and time utilities
>> * Compression utilities
>>
>
> Between Kudu and Impala (at least) there are many more opportunities for
> sharing. Threads, logging, metrics, concurrent primitives - the list is
> quite long.
>
>
>>
>> I hope the benefits are obvious: consolidating efforts on unit
>> testing, benchmarking, performance optimizations, continuous
>> integration, and platform compatibility.
>>
>> Logistically speaking, one possible avenue might be to use Apache
>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> small, and it builds and installs fast. It is intended as a library to
>> have its headers used and linked against other applications. (As an
>> aside, I'm very interested in building optional support for Arrow
>> columnar messages into the kudu client).
>>
>
> In principle I'm in favour of code sharing, and it seems very much in
> keeping with the Apache way. However, practically speaking I'm of the
> opinion that it only makes sense to house shared support code in a
> separate, dedicated project.
>
> Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> of sharing to utilities that Arrow is interested in. It would make no sense
> to add a threading library to Arrow if it was never used natively. Muddying
> the waters of the project's charter seems likely to lead to user, and
> developer, confusion. Similarly, we should not necessarily couple Arrow's
> design goals to those it inherits from Kudu and Impala's source code.
>
> I think I'd rather see a new Apache project than re-use a current one for
> two independent purposes.
>
>
>>
>> The downside of code sharing, which may have prevented it so far, are
>> the logistics of coordinating ASF release cycles and keeping build
>> toolchains in sync. It's taken us the past year to stabilize the
>> design of Arrow for its intended use cases, so at this point if we
>> went down this road I would be OK with helping the community commit to
>> a regular release cadence that would be faster than Impala, Kudu, and
>> Parquet's respective release cadences. Since members of the Kudu and
>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> collaborate to each other's mutual benefit and success.
>>
>> Note that Arrow does not throw C++ exceptions and similarly follows
>> Google C++ style guide to the same extent at Kudu and Impala.
>>
>> If this is something that either the Kudu or Impala communities would
>> like to pursue in earnest, I would be happy to work with you on next
>> steps. I would suggest that we start with something small so that we
>> could address the necessary build toolchain changes, and develop a
>> workflow for moving around code and tests, a protocol for code reviews
>> (e.g. Gerrit), and coordinating ASF releases.
>>
>
> I think, if I'm reading this correctly, that you're assuming integration
> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> their toolchains. For something as fast moving as utility code - and
> critical, where you want the latency between adding a fix and including it
> in your build to be ~0 - that's a non-starter to me, at least with how the
> toolchains are currently realised.
>
> I'd rather have the source code directly imported into Impala's tree -
> whether by git submodule or other mechanism. That way the coupling is
> looser, and we can move more quickly. I think that's important to other
> projects as well.
>
> Henry
>
>
>
>>
>> Let me know what you think.
>>
>> best
>> Wes
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Henry,

Thank you for these comments.

I think having a kind of "Apache Commons for [Modern] C++" would be an
ideal (though perhaps initially more labor intensive) solution.
There's code in Arrow that I would move into this project if it
existed. I am happy to help make this happen if there is interest from
the Kudu and Impala communities. I am not sure logistically what would
be the most expedient way to establish the project, whether as an ASF
Incubator project or possibly as a new TLP that could be created by
spinning IP out of Apache Kudu.

I'm interested to hear the opinions of others, and possible next steps.

Thanks
Wes

On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> Thanks for bringing this up, Wes.
>
> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>
>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>
>> (I'm not sure the best way to have a cross-list discussion, so I
>> apologize if this does not work well)
>>
>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> between the codebases in Apache Arrow and Apache Parquet, and
>> opportunities for more code sharing with Kudu and Impala as well.
>>
>> As context
>>
>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> first C++ release within Apache Parquet. I got involved with this
>> project a little over a year ago and was faced with the unpleasant
>> decision to copy and paste a significant amount of code out of
>> Impala's codebase to bootstrap the project.
>>
>> * In parallel, we begin the Apache Arrow project, which is designed to
>> be a complementary library for file formats (like Parquet), storage
>> engines (like Kudu), and compute engines (like Impala and pandas).
>>
>> * As Arrow and parquet-cpp matured, an increasing amount of code
>> overlap crept up surrounding buffer memory management and IO
>> interface. We recently decided in PARQUET-818
>> (https://github.com/apache/parquet-cpp/commit/
>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> to remove some of the obvious code overlap in Parquet and make
>> libarrow.a/so a hard compile and link-time dependency for
>> libparquet.a/so.
>>
>> * There is still quite a bit of code in parquet-cpp that would better
>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> compression, bit utilities, and so forth. Much of this code originated
>> from Impala
>>
>> This brings me to a next set of points:
>>
>> * parquet-cpp contains quite a bit of code that was extracted from
>> Impala. This is mostly self-contained in
>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>
>> * My understanding is that Kudu extracted certain computational
>> utilities from Impala in its early days, but these tools have likely
>> diverged as the needs of the projects have evolved.
>>
>> Since all of these projects are quite different in their end goals
>> (runtime systems vs. libraries), touching code that is tightly coupled
>> to either Kudu or Impala's runtimes is probably not worth discussing.
>> However, I think there is a strong basis for collaboration on
>> computational utilities and vectorized array processing. Some obvious
>> areas that come to mind:
>>
>> * SIMD utilities (for hashing or processing of preallocated contiguous
>> memory)
>> * Array encoding utilities: RLE / Dictionary, etc.
>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> contributed a patch to parquet-cpp around this)
>> * Date and time utilities
>> * Compression utilities
>>
>
> Between Kudu and Impala (at least) there are many more opportunities for
> sharing. Threads, logging, metrics, concurrent primitives - the list is
> quite long.
>
>
>>
>> I hope the benefits are obvious: consolidating efforts on unit
>> testing, benchmarking, performance optimizations, continuous
>> integration, and platform compatibility.
>>
>> Logistically speaking, one possible avenue might be to use Apache
>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> small, and it builds and installs fast. It is intended as a library to
>> have its headers used and linked against other applications. (As an
>> aside, I'm very interested in building optional support for Arrow
>> columnar messages into the kudu client).
>>
>
> In principle I'm in favour of code sharing, and it seems very much in
> keeping with the Apache way. However, practically speaking I'm of the
> opinion that it only makes sense to house shared support code in a
> separate, dedicated project.
>
> Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> of sharing to utilities that Arrow is interested in. It would make no sense
> to add a threading library to Arrow if it was never used natively. Muddying
> the waters of the project's charter seems likely to lead to user, and
> developer, confusion. Similarly, we should not necessarily couple Arrow's
> design goals to those it inherits from Kudu and Impala's source code.
>
> I think I'd rather see a new Apache project than re-use a current one for
> two independent purposes.
>
>
>>
>> The downside of code sharing, which may have prevented it so far, are
>> the logistics of coordinating ASF release cycles and keeping build
>> toolchains in sync. It's taken us the past year to stabilize the
>> design of Arrow for its intended use cases, so at this point if we
>> went down this road I would be OK with helping the community commit to
>> a regular release cadence that would be faster than Impala, Kudu, and
>> Parquet's respective release cadences. Since members of the Kudu and
>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> collaborate to each other's mutual benefit and success.
>>
>> Note that Arrow does not throw C++ exceptions and similarly follows
>> Google C++ style guide to the same extent at Kudu and Impala.
>>
>> If this is something that either the Kudu or Impala communities would
>> like to pursue in earnest, I would be happy to work with you on next
>> steps. I would suggest that we start with something small so that we
>> could address the necessary build toolchain changes, and develop a
>> workflow for moving around code and tests, a protocol for code reviews
>> (e.g. Gerrit), and coordinating ASF releases.
>>
>
> I think, if I'm reading this correctly, that you're assuming integration
> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> their toolchains. For something as fast moving as utility code - and
> critical, where you want the latency between adding a fix and including it
> in your build to be ~0 - that's a non-starter to me, at least with how the
> toolchains are currently realised.
>
> I'd rather have the source code directly imported into Impala's tree -
> whether by git submodule or other mechanism. That way the coupling is
> looser, and we can move more quickly. I think that's important to other
> projects as well.
>
> Henry
>
>
>
>>
>> Let me know what you think.
>>
>> best
>> Wes
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Wes McKinney <we...@gmail.com>.

hi Henry,

Thank you for these comments.

I think having a kind of "Apache Commons for [Modern] C++" would be an
ideal (though perhaps initially more labor intensive) solution.
There's code in Arrow that I would move into this project if it
existed. I am happy to help make this happen if there is interest from
the Kudu and Impala communities. I am not sure logistically what would
be the most expedient way to establish the project, whether as an ASF
Incubator project or possibly as a new TLP that could be created by
spinning IP out of Apache Kudu.

I'm interested to hear the opinions of others, and possible next steps.

Thanks
Wes

On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> Thanks for bringing this up, Wes.
>
> On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:
>
>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>
>> (I'm not sure the best way to have a cross-list discussion, so I
>> apologize if this does not work well)
>>
>> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> between the codebases in Apache Arrow and Apache Parquet, and
>> opportunities for more code sharing with Kudu and Impala as well.
>>
>> As context
>>
>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> first C++ release within Apache Parquet. I got involved with this
>> project a little over a year ago and was faced with the unpleasant
>> decision to copy and paste a significant amount of code out of
>> Impala's codebase to bootstrap the project.
>>
>> * In parallel, we begin the Apache Arrow project, which is designed to
>> be a complementary library for file formats (like Parquet), storage
>> engines (like Kudu), and compute engines (like Impala and pandas).
>>
>> * As Arrow and parquet-cpp matured, an increasing amount of code
>> overlap crept up surrounding buffer memory management and IO
>> interface. We recently decided in PARQUET-818
>> (https://github.com/apache/parquet-cpp/commit/
>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> to remove some of the obvious code overlap in Parquet and make
>> libarrow.a/so a hard compile and link-time dependency for
>> libparquet.a/so.
>>
>> * There is still quite a bit of code in parquet-cpp that would better
>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> compression, bit utilities, and so forth. Much of this code originated
>> from Impala
>>
>> This brings me to a next set of points:
>>
>> * parquet-cpp contains quite a bit of code that was extracted from
>> Impala. This is mostly self-contained in
>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>
>> * My understanding is that Kudu extracted certain computational
>> utilities from Impala in its early days, but these tools have likely
>> diverged as the needs of the projects have evolved.
>>
>> Since all of these projects are quite different in their end goals
>> (runtime systems vs. libraries), touching code that is tightly coupled
>> to either Kudu or Impala's runtimes is probably not worth discussing.
>> However, I think there is a strong basis for collaboration on
>> computational utilities and vectorized array processing. Some obvious
>> areas that come to mind:
>>
>> * SIMD utilities (for hashing or processing of preallocated contiguous
>> memory)
>> * Array encoding utilities: RLE / Dictionary, etc.
>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> contributed a patch to parquet-cpp around this)
>> * Date and time utilities
>> * Compression utilities
>>
>
> Between Kudu and Impala (at least) there are many more opportunities for
> sharing. Threads, logging, metrics, concurrent primitives - the list is
> quite long.
>
>
>>
>> I hope the benefits are obvious: consolidating efforts on unit
>> testing, benchmarking, performance optimizations, continuous
>> integration, and platform compatibility.
>>
>> Logistically speaking, one possible avenue might be to use Apache
>> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> small, and it builds and installs fast. It is intended as a library to
>> have its headers used and linked against other applications. (As an
>> aside, I'm very interested in building optional support for Arrow
>> columnar messages into the kudu client).
>>
>
> In principle I'm in favour of code sharing, and it seems very much in
> keeping with the Apache way. However, practically speaking I'm of the
> opinion that it only makes sense to house shared support code in a
> separate, dedicated project.
>
> Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> of sharing to utilities that Arrow is interested in. It would make no sense
> to add a threading library to Arrow if it was never used natively. Muddying
> the waters of the project's charter seems likely to lead to user, and
> developer, confusion. Similarly, we should not necessarily couple Arrow's
> design goals to those it inherits from Kudu and Impala's source code.
>
> I think I'd rather see a new Apache project than re-use a current one for
> two independent purposes.
>
>
>>
>> The downside of code sharing, which may have prevented it so far, are
>> the logistics of coordinating ASF release cycles and keeping build
>> toolchains in sync. It's taken us the past year to stabilize the
>> design of Arrow for its intended use cases, so at this point if we
>> went down this road I would be OK with helping the community commit to
>> a regular release cadence that would be faster than Impala, Kudu, and
>> Parquet's respective release cadences. Since members of the Kudu and
>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> collaborate to each other's mutual benefit and success.
>>
>> Note that Arrow does not throw C++ exceptions and similarly follows
>> Google C++ style guide to the same extent at Kudu and Impala.
>>
>> If this is something that either the Kudu or Impala communities would
>> like to pursue in earnest, I would be happy to work with you on next
>> steps. I would suggest that we start with something small so that we
>> could address the necessary build toolchain changes, and develop a
>> workflow for moving around code and tests, a protocol for code reviews
>> (e.g. Gerrit), and coordinating ASF releases.
>>
>
> I think, if I'm reading this correctly, that you're assuming integration
> with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> their toolchains. For something as fast moving as utility code - and
> critical, where you want the latency between adding a fix and including it
> in your build to be ~0 - that's a non-starter to me, at least with how the
> toolchains are currently realised.
>
> I'd rather have the source code directly imported into Impala's tree -
> whether by git submodule or other mechanism. That way the coupling is
> looser, and we can move more quickly. I think that's important to other
> projects as well.
>
> Henry
>
>
>
>>
>> Let me know what you think.
>>
>> best
>> Wes
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Henry Robinson <he...@apache.org>.

Thanks for bringing this up, Wes.

On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:

> Dear Apache Kudu and Apache Impala (incubating) communities,
>
> (I'm not sure the best way to have a cross-list discussion, so I
> apologize if this does not work well)
>
> On the recent Apache Parquet sync call, we discussed C++ code sharing
> between the codebases in Apache Arrow and Apache Parquet, and
> opportunities for more code sharing with Kudu and Impala as well.
>
> As context
>
> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> first C++ release within Apache Parquet. I got involved with this
> project a little over a year ago and was faced with the unpleasant
> decision to copy and paste a significant amount of code out of
> Impala's codebase to bootstrap the project.
>
> * In parallel, we begin the Apache Arrow project, which is designed to
> be a complementary library for file formats (like Parquet), storage
> engines (like Kudu), and compute engines (like Impala and pandas).
>
> * As Arrow and parquet-cpp matured, an increasing amount of code
> overlap crept up surrounding buffer memory management and IO
> interface. We recently decided in PARQUET-818
> (https://github.com/apache/parquet-cpp/commit/
> 2154e873d5aa7280314189a2683fb1e12a590c02)
> to remove some of the obvious code overlap in Parquet and make
> libarrow.a/so a hard compile and link-time dependency for
> libparquet.a/so.
>
> * There is still quite a bit of code in parquet-cpp that would better
> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> compression, bit utilities, and so forth. Much of this code originated
> from Impala
>
> This brings me to a next set of points:
>
> * parquet-cpp contains quite a bit of code that was extracted from
> Impala. This is mostly self-contained in
> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>
> * My understanding is that Kudu extracted certain computational
> utilities from Impala in its early days, but these tools have likely
> diverged as the needs of the projects have evolved.
>
> Since all of these projects are quite different in their end goals
> (runtime systems vs. libraries), touching code that is tightly coupled
> to either Kudu or Impala's runtimes is probably not worth discussing.
> However, I think there is a strong basis for collaboration on
> computational utilities and vectorized array processing. Some obvious
> areas that come to mind:
>
> * SIMD utilities (for hashing or processing of preallocated contiguous
> memory)
> * Array encoding utilities: RLE / Dictionary, etc.
> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> contributed a patch to parquet-cpp around this)
> * Date and time utilities
> * Compression utilities
>

Between Kudu and Impala (at least) there are many more opportunities for
sharing. Threads, logging, metrics, concurrent primitives - the list is
quite long.


>
> I hope the benefits are obvious: consolidating efforts on unit
> testing, benchmarking, performance optimizations, continuous
> integration, and platform compatibility.
>
> Logistically speaking, one possible avenue might be to use Apache
> Arrow as the place to assemble this code. Its thirdparty toolchain is
> small, and it builds and installs fast. It is intended as a library to
> have its headers used and linked against other applications. (As an
> aside, I'm very interested in building optional support for Arrow
> columnar messages into the kudu client).
>

In principle I'm in favour of code sharing, and it seems very much in
keeping with the Apache way. However, practically speaking I'm of the
opinion that it only makes sense to house shared support code in a
separate, dedicated project.

Embedding the shared libraries in, e.g., Arrow naturally limits the scope
of sharing to utilities that Arrow is interested in. It would make no sense
to add a threading library to Arrow if it was never used natively. Muddying
the waters of the project's charter seems likely to lead to user, and
developer, confusion. Similarly, we should not necessarily couple Arrow's
design goals to those it inherits from Kudu and Impala's source code.

I think I'd rather see a new Apache project than re-use a current one for
two independent purposes.


>
> The downside of code sharing, which may have prevented it so far, are
> the logistics of coordinating ASF release cycles and keeping build
> toolchains in sync. It's taken us the past year to stabilize the
> design of Arrow for its intended use cases, so at this point if we
> went down this road I would be OK with helping the community commit to
> a regular release cadence that would be faster than Impala, Kudu, and
> Parquet's respective release cadences. Since members of the Kudu and
> Impala PMC are also on the Arrow PMC, I trust we would be able to
> collaborate to each other's mutual benefit and success.
>
> Note that Arrow does not throw C++ exceptions and similarly follows
> Google C++ style guide to the same extent at Kudu and Impala.
>
> If this is something that either the Kudu or Impala communities would
> like to pursue in earnest, I would be happy to work with you on next
> steps. I would suggest that we start with something small so that we
> could address the necessary build toolchain changes, and develop a
> workflow for moving around code and tests, a protocol for code reviews
> (e.g. Gerrit), and coordinating ASF releases.
>

I think, if I'm reading this correctly, that you're assuming integration
with the 'downstream' projects (e.g. Impala and Kudu) would be done via
their toolchains. For something as fast moving as utility code - and
critical, where you want the latency between adding a fix and including it
in your build to be ~0 - that's a non-starter to me, at least with how the
toolchains are currently realised.

I'd rather have the source code directly imported into Impala's tree -
whether by git submodule or other mechanism. That way the coupling is
looser, and we can move more quickly. I think that's important to other
projects as well.

Henry



>
> Let me know what you think.
>
> best
> Wes
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Henry Robinson <he...@apache.org>.

Thanks for bringing this up, Wes.

On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:

> Dear Apache Kudu and Apache Impala (incubating) communities,
>
> (I'm not sure the best way to have a cross-list discussion, so I
> apologize if this does not work well)
>
> On the recent Apache Parquet sync call, we discussed C++ code sharing
> between the codebases in Apache Arrow and Apache Parquet, and
> opportunities for more code sharing with Kudu and Impala as well.
>
> As context
>
> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> first C++ release within Apache Parquet. I got involved with this
> project a little over a year ago and was faced with the unpleasant
> decision to copy and paste a significant amount of code out of
> Impala's codebase to bootstrap the project.
>
> * In parallel, we begin the Apache Arrow project, which is designed to
> be a complementary library for file formats (like Parquet), storage
> engines (like Kudu), and compute engines (like Impala and pandas).
>
> * As Arrow and parquet-cpp matured, an increasing amount of code
> overlap crept up surrounding buffer memory management and IO
> interface. We recently decided in PARQUET-818
> (https://github.com/apache/parquet-cpp/commit/
> 2154e873d5aa7280314189a2683fb1e12a590c02)
> to remove some of the obvious code overlap in Parquet and make
> libarrow.a/so a hard compile and link-time dependency for
> libparquet.a/so.
>
> * There is still quite a bit of code in parquet-cpp that would better
> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> compression, bit utilities, and so forth. Much of this code originated
> from Impala
>
> This brings me to a next set of points:
>
> * parquet-cpp contains quite a bit of code that was extracted from
> Impala. This is mostly self-contained in
> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>
> * My understanding is that Kudu extracted certain computational
> utilities from Impala in its early days, but these tools have likely
> diverged as the needs of the projects have evolved.
>
> Since all of these projects are quite different in their end goals
> (runtime systems vs. libraries), touching code that is tightly coupled
> to either Kudu or Impala's runtimes is probably not worth discussing.
> However, I think there is a strong basis for collaboration on
> computational utilities and vectorized array processing. Some obvious
> areas that come to mind:
>
> * SIMD utilities (for hashing or processing of preallocated contiguous
> memory)
> * Array encoding utilities: RLE / Dictionary, etc.
> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> contributed a patch to parquet-cpp around this)
> * Date and time utilities
> * Compression utilities
>

Between Kudu and Impala (at least) there are many more opportunities for
sharing. Threads, logging, metrics, concurrent primitives - the list is
quite long.


>
> I hope the benefits are obvious: consolidating efforts on unit
> testing, benchmarking, performance optimizations, continuous
> integration, and platform compatibility.
>
> Logistically speaking, one possible avenue might be to use Apache
> Arrow as the place to assemble this code. Its thirdparty toolchain is
> small, and it builds and installs fast. It is intended as a library to
> have its headers used and linked against other applications. (As an
> aside, I'm very interested in building optional support for Arrow
> columnar messages into the kudu client).
>

In principle I'm in favour of code sharing, and it seems very much in
keeping with the Apache way. However, practically speaking I'm of the
opinion that it only makes sense to house shared support code in a
separate, dedicated project.

Embedding the shared libraries in, e.g., Arrow naturally limits the scope
of sharing to utilities that Arrow is interested in. It would make no sense
to add a threading library to Arrow if it was never used natively. Muddying
the waters of the project's charter seems likely to lead to user, and
developer, confusion. Similarly, we should not necessarily couple Arrow's
design goals to those it inherits from Kudu and Impala's source code.

I think I'd rather see a new Apache project than re-use a current one for
two independent purposes.


>
> The downside of code sharing, which may have prevented it so far, are
> the logistics of coordinating ASF release cycles and keeping build
> toolchains in sync. It's taken us the past year to stabilize the
> design of Arrow for its intended use cases, so at this point if we
> went down this road I would be OK with helping the community commit to
> a regular release cadence that would be faster than Impala, Kudu, and
> Parquet's respective release cadences. Since members of the Kudu and
> Impala PMC are also on the Arrow PMC, I trust we would be able to
> collaborate to each other's mutual benefit and success.
>
> Note that Arrow does not throw C++ exceptions and similarly follows
> Google C++ style guide to the same extent at Kudu and Impala.
>
> If this is something that either the Kudu or Impala communities would
> like to pursue in earnest, I would be happy to work with you on next
> steps. I would suggest that we start with something small so that we
> could address the necessary build toolchain changes, and develop a
> workflow for moving around code and tests, a protocol for code reviews
> (e.g. Gerrit), and coordinating ASF releases.
>

I think, if I'm reading this correctly, that you're assuming integration
with the 'downstream' projects (e.g. Impala and Kudu) would be done via
their toolchains. For something as fast moving as utility code - and
critical, where you want the latency between adding a fix and including it
in your build to be ~0 - that's a non-starter to me, at least with how the
toolchains are currently realised.

I'd rather have the source code directly imported into Impala's tree -
whether by git submodule or other mechanism. That way the coupling is
looser, and we can move more quickly. I think that's important to other
projects as well.

Henry



>
> Let me know what you think.
>
> best
> Wes
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Henry Robinson <he...@apache.org>.

Thanks for bringing this up, Wes.

On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:

> Dear Apache Kudu and Apache Impala (incubating) communities,
>
> (I'm not sure the best way to have a cross-list discussion, so I
> apologize if this does not work well)
>
> On the recent Apache Parquet sync call, we discussed C++ code sharing
> between the codebases in Apache Arrow and Apache Parquet, and
> opportunities for more code sharing with Kudu and Impala as well.
>
> As context
>
> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> first C++ release within Apache Parquet. I got involved with this
> project a little over a year ago and was faced with the unpleasant
> decision to copy and paste a significant amount of code out of
> Impala's codebase to bootstrap the project.
>
> * In parallel, we begin the Apache Arrow project, which is designed to
> be a complementary library for file formats (like Parquet), storage
> engines (like Kudu), and compute engines (like Impala and pandas).
>
> * As Arrow and parquet-cpp matured, an increasing amount of code
> overlap crept up surrounding buffer memory management and IO
> interface. We recently decided in PARQUET-818
> (https://github.com/apache/parquet-cpp/commit/
> 2154e873d5aa7280314189a2683fb1e12a590c02)
> to remove some of the obvious code overlap in Parquet and make
> libarrow.a/so a hard compile and link-time dependency for
> libparquet.a/so.
>
> * There is still quite a bit of code in parquet-cpp that would better
> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> compression, bit utilities, and so forth. Much of this code originated
> from Impala
>
> This brings me to a next set of points:
>
> * parquet-cpp contains quite a bit of code that was extracted from
> Impala. This is mostly self-contained in
> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>
> * My understanding is that Kudu extracted certain computational
> utilities from Impala in its early days, but these tools have likely
> diverged as the needs of the projects have evolved.
>
> Since all of these projects are quite different in their end goals
> (runtime systems vs. libraries), touching code that is tightly coupled
> to either Kudu or Impala's runtimes is probably not worth discussing.
> However, I think there is a strong basis for collaboration on
> computational utilities and vectorized array processing. Some obvious
> areas that come to mind:
>
> * SIMD utilities (for hashing or processing of preallocated contiguous
> memory)
> * Array encoding utilities: RLE / Dictionary, etc.
> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> contributed a patch to parquet-cpp around this)
> * Date and time utilities
> * Compression utilities
>

Between Kudu and Impala (at least) there are many more opportunities for
sharing. Threads, logging, metrics, concurrent primitives - the list is
quite long.


>
> I hope the benefits are obvious: consolidating efforts on unit
> testing, benchmarking, performance optimizations, continuous
> integration, and platform compatibility.
>
> Logistically speaking, one possible avenue might be to use Apache
> Arrow as the place to assemble this code. Its thirdparty toolchain is
> small, and it builds and installs fast. It is intended as a library to
> have its headers used and linked against other applications. (As an
> aside, I'm very interested in building optional support for Arrow
> columnar messages into the kudu client).
>

In principle I'm in favour of code sharing, and it seems very much in
keeping with the Apache way. However, practically speaking I'm of the
opinion that it only makes sense to house shared support code in a
separate, dedicated project.

Embedding the shared libraries in, e.g., Arrow naturally limits the scope
of sharing to utilities that Arrow is interested in. It would make no sense
to add a threading library to Arrow if it was never used natively. Muddying
the waters of the project's charter seems likely to lead to user, and
developer, confusion. Similarly, we should not necessarily couple Arrow's
design goals to those it inherits from Kudu and Impala's source code.

I think I'd rather see a new Apache project than re-use a current one for
two independent purposes.


>
> The downside of code sharing, which may have prevented it so far, are
> the logistics of coordinating ASF release cycles and keeping build
> toolchains in sync. It's taken us the past year to stabilize the
> design of Arrow for its intended use cases, so at this point if we
> went down this road I would be OK with helping the community commit to
> a regular release cadence that would be faster than Impala, Kudu, and
> Parquet's respective release cadences. Since members of the Kudu and
> Impala PMC are also on the Arrow PMC, I trust we would be able to
> collaborate to each other's mutual benefit and success.
>
> Note that Arrow does not throw C++ exceptions and similarly follows
> Google C++ style guide to the same extent at Kudu and Impala.
>
> If this is something that either the Kudu or Impala communities would
> like to pursue in earnest, I would be happy to work with you on next
> steps. I would suggest that we start with something small so that we
> could address the necessary build toolchain changes, and develop a
> workflow for moving around code and tests, a protocol for code reviews
> (e.g. Gerrit), and coordinating ASF releases.
>

I think, if I'm reading this correctly, that you're assuming integration
with the 'downstream' projects (e.g. Impala and Kudu) would be done via
their toolchains. For something as fast moving as utility code - and
critical, where you want the latency between adding a fix and including it
in your build to be ~0 - that's a non-starter to me, at least with how the
toolchains are currently realised.

I'd rather have the source code directly imported into Impala's tree -
whether by git submodule or other mechanism. That way the coupling is
looser, and we can move more quickly. I think that's important to other
projects as well.

Henry



>
> Let me know what you think.
>
> best
> Wes
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Posted by Henry Robinson <he...@apache.org>.

Thanks for bringing this up, Wes.

On 25 February 2017 at 14:18, Wes McKinney <we...@gmail.com> wrote:

> Dear Apache Kudu and Apache Impala (incubating) communities,
>
> (I'm not sure the best way to have a cross-list discussion, so I
> apologize if this does not work well)
>
> On the recent Apache Parquet sync call, we discussed C++ code sharing
> between the codebases in Apache Arrow and Apache Parquet, and
> opportunities for more code sharing with Kudu and Impala as well.
>
> As context
>
> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> first C++ release within Apache Parquet. I got involved with this
> project a little over a year ago and was faced with the unpleasant
> decision to copy and paste a significant amount of code out of
> Impala's codebase to bootstrap the project.
>
> * In parallel, we begin the Apache Arrow project, which is designed to
> be a complementary library for file formats (like Parquet), storage
> engines (like Kudu), and compute engines (like Impala and pandas).
>
> * As Arrow and parquet-cpp matured, an increasing amount of code
> overlap crept up surrounding buffer memory management and IO
> interface. We recently decided in PARQUET-818
> (https://github.com/apache/parquet-cpp/commit/
> 2154e873d5aa7280314189a2683fb1e12a590c02)
> to remove some of the obvious code overlap in Parquet and make
> libarrow.a/so a hard compile and link-time dependency for
> libparquet.a/so.
>
> * There is still quite a bit of code in parquet-cpp that would better
> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> compression, bit utilities, and so forth. Much of this code originated
> from Impala
>
> This brings me to a next set of points:
>
> * parquet-cpp contains quite a bit of code that was extracted from
> Impala. This is mostly self-contained in
> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>
> * My understanding is that Kudu extracted certain computational
> utilities from Impala in its early days, but these tools have likely
> diverged as the needs of the projects have evolved.
>
> Since all of these projects are quite different in their end goals
> (runtime systems vs. libraries), touching code that is tightly coupled
> to either Kudu or Impala's runtimes is probably not worth discussing.
> However, I think there is a strong basis for collaboration on
> computational utilities and vectorized array processing. Some obvious
> areas that come to mind:
>
> * SIMD utilities (for hashing or processing of preallocated contiguous
> memory)
> * Array encoding utilities: RLE / Dictionary, etc.
> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> contributed a patch to parquet-cpp around this)
> * Date and time utilities
> * Compression utilities
>

Between Kudu and Impala (at least) there are many more opportunities for
sharing. Threads, logging, metrics, concurrent primitives - the list is
quite long.


>
> I hope the benefits are obvious: consolidating efforts on unit
> testing, benchmarking, performance optimizations, continuous
> integration, and platform compatibility.
>
> Logistically speaking, one possible avenue might be to use Apache
> Arrow as the place to assemble this code. Its thirdparty toolchain is
> small, and it builds and installs fast. It is intended as a library to
> have its headers used and linked against other applications. (As an
> aside, I'm very interested in building optional support for Arrow
> columnar messages into the kudu client).
>

In principle I'm in favour of code sharing, and it seems very much in
keeping with the Apache way. However, practically speaking I'm of the
opinion that it only makes sense to house shared support code in a
separate, dedicated project.

Embedding the shared libraries in, e.g., Arrow naturally limits the scope
of sharing to utilities that Arrow is interested in. It would make no sense
to add a threading library to Arrow if it was never used natively. Muddying
the waters of the project's charter seems likely to lead to user, and
developer, confusion. Similarly, we should not necessarily couple Arrow's
design goals to those it inherits from Kudu and Impala's source code.

I think I'd rather see a new Apache project than re-use a current one for
two independent purposes.


>
> The downside of code sharing, which may have prevented it so far, are
> the logistics of coordinating ASF release cycles and keeping build
> toolchains in sync. It's taken us the past year to stabilize the
> design of Arrow for its intended use cases, so at this point if we
> went down this road I would be OK with helping the community commit to
> a regular release cadence that would be faster than Impala, Kudu, and
> Parquet's respective release cadences. Since members of the Kudu and
> Impala PMC are also on the Arrow PMC, I trust we would be able to
> collaborate to each other's mutual benefit and success.
>
> Note that Arrow does not throw C++ exceptions and similarly follows
> Google C++ style guide to the same extent at Kudu and Impala.
>
> If this is something that either the Kudu or Impala communities would
> like to pursue in earnest, I would be happy to work with you on next
> steps. I would suggest that we start with something small so that we
> could address the necessary build toolchain changes, and develop a
> workflow for moving around code and tests, a protocol for code reviews
> (e.g. Gerrit), and coordinating ASF releases.
>

I think, if I'm reading this correctly, that you're assuming integration
with the 'downstream' projects (e.g. Impala and Kudu) would be done via
their toolchains. For something as fast moving as utility code - and
critical, where you want the latency between adding a fix and including it
in your build to be ~0 - that's a non-starter to me, at least with how the
toolchains are currently realised.

I'd rather have the source code directly imported into Impala's tree -
whether by git submodule or other mechanism. That way the coupling is
looser, and we can move more quickly. I think that's important to other
projects as well.

Henry



>
> Let me know what you think.
>
> best
> Wes
>