You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Henry Robinson <he...@apache.org> on 2016/03/10 19:38:24 UTC

Getting rid of thirdparty

One of the tasks remaining before we can push Impala's code to the ASF's
git instance is to reduce the size of the repository. Right now even a
checkout of origin/cdh5-trunk is in the multi-GB range.

The vast majority of that is in the thirdparty/ directory, which adds up
over the git history to be pretty huge with all the various versions we've
checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
rid of thirdparty/ altogether.

There are two main dependency types in thirdparty/. The first is a
compile-time C++ dependency like open-ldap or avro-c. These are (almost)
all superseded by the toolchain (see
https://github.com/cloudera/native-toolchain) build. A couple of exceptions
are Squeasel and Mustache which don't produce their own libraries but are
source files directly included in the Impala build. I don't see a good
reason we couldn't move those to the toolchain as well.

The other kind of dependency are the test binaries that are used when we
start Impala's test environment (i.e. start the Hive metastore, HBase, etc,
etc.). These are trickier to extract (they're not just JARs, but bin/hadoop
etc. etc.). We also need to be able to change these dependencies pretty
efficiently - the upstream ASF repo should use ASF-released artifacts here,
but downstream vendors (like Cloudera) will want to replace the ASF
artifacts with their own releases.

Note that the Java binaries in thirdparty/ are *not* the compile-time
dependencies for Impala's Java frontend - those are resolved via Maven.
It's a bad thing that there's two dependency resolution mechanisms, but we
might not be able to solve that issue right now.

So what should we do with the test dependencies? I see the following
options:

1. Put them in the native-toolchain repository. *Pros:* (almost) all
dependency resolution comes from one place. *Cons:* native-toolchain would
change very frequently as new releases happen.

2. Don't provide any built-in mechanism for starting a test environment. If
you want to test Impala - set up your own Hadoop cluster instance.
*Pros:* removes
a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
harder to run self-contained tests.

3. Have a separate test-dependencies repository that does basically the
same thing as the toolchain. *Pros:* separates out fast-moving dependencies
from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
both repositories (as libhdfs is a compile-time dependency for the backend).

My preference is for option #1. We can do something like the following:

* Add a cmake target to 'build' a test environment (resolve test
dependencies, start mini-cluster using checked-in scripts)
* Add scripts to native-toolchain to download tarballs for HBase, HDFS,
Hive and others just like compile-time dependencies. Update Impala's CMake
scripts to use those the local toolchain directory to find binaries,
management scripts etc.
* During each upstream release, add any new dependencies to
native-toolchain, and update impala.git/bin/impala-config.sh with the new
version numbers.

What does everyone think?

Re: Getting rid of thirdparty

Posted by Casey Ching <ca...@cloudera.com>.
I think the only other choice is a released version. We’d have to get someone to setup weekly snapshots for all our stuff.

On March 11, 2016 at 11:23:25 AM, Skye Wanderman-Milne (skye@cloudera.com) wrote:

For Maven, can we not specify a different version than SNAPSHOT to get more  
stable dependencies?  

On Thu, Mar 10, 2016 at 1:24 PM, Casey Ching <ca...@cloudera.com> wrote:  

> I looked into running the test services directly through maven and it does  
> work but after thinking about it more, we’d no longer be able to control  
> when to upgrade java third party. Basically we’d upgrade every night. That  
> may actually be the best approach for apache impala but I don’t think we’d  
> like that at Cloudera.  
>  
> On March 10, 2016 at 11:55:09 AM, Tim Armstrong (tarmstrong@cloudera.com)  
> wrote:  
>  
> My previous response was missing some context. There's  
> bin/bootstrap_toolchain.py in the Impala repo that downloads prebuilt  
> dependencies of the right versions from S3. I modifying this script or  
> creating a similar script to download pre-built test dependencies is a good  
> idea.  
>  
> There is a different aspect to the native toolchain, the build scripts in  
> native-toolchain that bootstrap Impala's native dependencies starting from  
> gcc. The output artifacts of this process are uploaded to S3. Other  
> dependencies (hadoop, etc) are built in a different way so I think the  
> native-toolchain repo doesn't need to know about them. libhdfs is maybe a  
> corner case where it would be good to add it to the toolchain if possible  
> to make the build more reproducible.  
>  
> On Thu, Mar 10, 2016 at 11:24 AM, Daniel Hecht <dh...@cloudera.com>  
> wrote:  
>  
> > On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com>  
> > wrote:  
> > > I didn't think that binaries were uploaded to any repository, but  
> instead  
> > > to S3 (and therefore there's no version history) or some other URL.  
> > That's  
> > > what I'd suggest we continue to do.  
> > >  
> >  
> > A bit of a tangent (but important if we will rely even more on  
> > toolchain: the fact that the binaries (and clean source) are only  
> > copied to S3 seems like a problem. What happens if someone  
> > accidentally 'rm -rf' the toolchain bucket? Can we reproduce our old  
> > build exactly? Are we at least backing up the S3 toolchain bucket  
> > somehow?  
> >  
>  

Re: Getting rid of thirdparty

Posted by Skye Wanderman-Milne <sk...@cloudera.com>.
For Maven, can we not specify a different version than SNAPSHOT to get more
stable dependencies?

On Thu, Mar 10, 2016 at 1:24 PM, Casey Ching <ca...@cloudera.com> wrote:

> I looked into running the test services directly through maven and it does
> work but after thinking about it more, we’d no longer be able to control
> when to upgrade java third party. Basically we’d upgrade every night. That
> may actually be the best approach for apache impala but I don’t think we’d
> like that at Cloudera.
>
> On March 10, 2016 at 11:55:09 AM, Tim Armstrong (tarmstrong@cloudera.com)
> wrote:
>
> My previous response was missing some context. There's
> bin/bootstrap_toolchain.py in the Impala repo that downloads prebuilt
> dependencies of the right versions from S3. I modifying this script or
> creating a similar script to download pre-built test dependencies is a good
> idea.
>
> There is a different aspect to the native toolchain, the build scripts in
> native-toolchain that bootstrap Impala's native dependencies starting from
> gcc. The output artifacts of this process are uploaded to S3. Other
> dependencies (hadoop, etc) are built in a different way so I think the
> native-toolchain repo doesn't need to know about them. libhdfs is maybe a
> corner case where it would be good to add it to the toolchain if possible
> to make the build more reproducible.
>
> On Thu, Mar 10, 2016 at 11:24 AM, Daniel Hecht <dh...@cloudera.com>
> wrote:
>
> > On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com>
> > wrote:
> > > I didn't think that binaries were uploaded to any repository, but
> instead
> > > to S3 (and therefore there's no version history) or some other URL.
> > That's
> > > what I'd suggest we continue to do.
> > >
> >
> > A bit of a tangent (but important if we will rely even more on
> > toolchain: the fact that the binaries (and clean source) are only
> > copied to S3 seems like a problem. What happens if someone
> > accidentally 'rm -rf' the toolchain bucket? Can we reproduce our old
> > build exactly? Are we at least backing up the S3 toolchain bucket
> > somehow?
> >
>

Re: Getting rid of thirdparty

Posted by Casey Ching <ca...@cloudera.com>.
I looked into running the test services directly through maven and it does work but after thinking about it more, we’d no longer be able to control when to upgrade java third party. Basically we’d upgrade every night. That may actually be the best approach for apache impala but I don’t think we’d like that at Cloudera. 

On March 10, 2016 at 11:55:09 AM, Tim Armstrong (tarmstrong@cloudera.com) wrote:

My previous response was missing some context. There's  
bin/bootstrap_toolchain.py in the Impala repo that downloads prebuilt  
dependencies of the right versions from S3. I modifying this script or  
creating a similar script to download pre-built test dependencies is a good  
idea.  

There is a different aspect to the native toolchain, the build scripts in  
native-toolchain that bootstrap Impala's native dependencies starting from  
gcc. The output artifacts of this process are uploaded to S3. Other  
dependencies (hadoop, etc) are built in a different way so I think the  
native-toolchain repo doesn't need to know about them. libhdfs is maybe a  
corner case where it would be good to add it to the toolchain if possible  
to make the build more reproducible.  

On Thu, Mar 10, 2016 at 11:24 AM, Daniel Hecht <dh...@cloudera.com> wrote:  

> On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com>  
> wrote:  
> > I didn't think that binaries were uploaded to any repository, but instead  
> > to S3 (and therefore there's no version history) or some other URL.  
> That's  
> > what I'd suggest we continue to do.  
> >  
>  
> A bit of a tangent (but important if we will rely even more on  
> toolchain: the fact that the binaries (and clean source) are only  
> copied to S3 seems like a problem. What happens if someone  
> accidentally 'rm -rf' the toolchain bucket? Can we reproduce our old  
> build exactly? Are we at least backing up the S3 toolchain bucket  
> somehow?  
>  

Re: Getting rid of thirdparty

Posted by Tim Armstrong <ta...@cloudera.com>.
My previous response was missing some context.  There's
bin/bootstrap_toolchain.py in the Impala repo that downloads prebuilt
dependencies of the right versions from S3. I modifying this script or
creating a similar script to download pre-built test dependencies is a good
idea.

There is a different aspect to the native toolchain, the build scripts in
native-toolchain that bootstrap Impala's native dependencies starting from
gcc. The output artifacts of this process are uploaded to S3. Other
dependencies (hadoop, etc) are built in a different way so I think the
native-toolchain repo doesn't need to know about them. libhdfs is maybe a
corner case where it would be good to add it to the toolchain if possible
to make the build more reproducible.

On Thu, Mar 10, 2016 at 11:24 AM, Daniel Hecht <dh...@cloudera.com> wrote:

> On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com>
> wrote:
> > I didn't think that binaries were uploaded to any repository, but instead
> > to S3 (and therefore there's no version history) or some other URL.
> That's
> > what I'd suggest we continue to do.
> >
>
> A bit of a tangent (but important if we will rely even more on
> toolchain: the fact that the binaries (and clean source) are only
> copied to S3 seems like a problem.  What happens if someone
> accidentally 'rm -rf' the toolchain bucket?  Can we reproduce our old
> build exactly?  Are we at least backing up the S3 toolchain bucket
> somehow?
>

Re: Getting rid of thirdparty

Posted by Henry Robinson <he...@cloudera.com>.
On 10 March 2016 at 11:24, Daniel Hecht <dh...@cloudera.com> wrote:

> On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com>
> wrote:
> > I didn't think that binaries were uploaded to any repository, but instead
> > to S3 (and therefore there's no version history) or some other URL.
> That's
> > what I'd suggest we continue to do.
> >
>
> A bit of a tangent (but important if we will rely even more on
> toolchain: the fact that the binaries (and clean source) are only
> copied to S3 seems like a problem.  What happens if someone
> accidentally 'rm -rf' the toolchain bucket?  Can we reproduce our old
> build exactly?  Are we at least backing up the S3 toolchain bucket
> somehow?
>

(That is a tangent (but an important one); let's take that up on a
different internal thread as that's really a Cloudera concern, not an
Apache Impala one.)

Chatted with Tim and Dan a bit about this - I'd forgotten that
bootstrap_toolchain.py now lives in Impala, so there's not really a need to
involve the native-toolchain repo in this. We can just update
bootstrap_toolchain.py to deal with non-native-toolchain dependencies.

Re: Getting rid of thirdparty

Posted by Daniel Hecht <dh...@cloudera.com>.
On Thu, Mar 10, 2016 at 11:10 AM, Henry Robinson <he...@cloudera.com> wrote:
> I didn't think that binaries were uploaded to any repository, but instead
> to S3 (and therefore there's no version history) or some other URL. That's
> what I'd suggest we continue to do.
>

A bit of a tangent (but important if we will rely even more on
toolchain: the fact that the binaries (and clean source) are only
copied to S3 seems like a problem.  What happens if someone
accidentally 'rm -rf' the toolchain bucket?  Can we reproduce our old
build exactly?  Are we at least backing up the S3 toolchain bucket
somehow?

Re: Getting rid of thirdparty

Posted by Henry Robinson <he...@cloudera.com>.
I didn't think that binaries were uploaded to any repository, but instead
to S3 (and therefore there's no version history) or some other URL. That's
what I'd suggest we continue to do.

Cloudera and the Apache Impala project should do what's best for them,
independently. I bet Cloudera can fork the native-toolchain repository and
set the dependency versions as desired. Then the dependencies can be
uploaded to a Cloudera-specific location.

Maven would also be an ok route to explore - are start / stop scripts etc.
routinely checked into Maven by other projects? The nice thing about the
toolchain is that we can usually rely on a longer lifetime for published
artifacts (in my experience, dependencies can come and go with Maven).

On 10 March 2016 at 11:03, Casey Ching <ca...@cloudera.com> wrote:

> I suspect we can actually run all the test services using the maven
> artifacts. Maybe we can investigate that?
>
> There’s not enough information about #1. How do updates work? The nice
> thing about the current setup is anyone can checkout any commit and there’s
> a decent chance that checkout will build. Are we going to keep that
> ability? How does this work for Cloudera, Apache, and others, are we going
> to upload all test binaries to the same repo?
>
>
> On March 10, 2016 at 10:52:07 AM, Jim Apple (jbapple@cloudera.com) wrote:
> Both #1 and #3 seem reasonable to me. I think #2 should be avoided because
> the Con you listed will, I think, make contributing to Impala difficult
> for
> new contributors, and I think that's more serious than the Cons for #1 and
> #3.
>
> On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <he...@apache.org>
> wrote:
>
> > One of the tasks remaining before we can push Impala's code to the ASF's
> > git instance is to reduce the size of the repository. Right now even a
> > checkout of origin/cdh5-trunk is in the multi-GB range.
> >
> > The vast majority of that is in the thirdparty/ directory, which adds up
> > over the git history to be pretty huge with all the various versions
> we've
> > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we
> get
> > rid of thirdparty/ altogether.
> >
> > There are two main dependency types in thirdparty/. The first is a
> > compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> > all superseded by the toolchain (see
> > https://github.com/cloudera/native-toolchain) build. A couple of
> > exceptions
> > are Squeasel and Mustache which don't produce their own libraries but
> are
> > source files directly included in the Impala build. I don't see a good
> > reason we couldn't move those to the toolchain as well.
> >
> > The other kind of dependency are the test binaries that are used when we
> > start Impala's test environment (i.e. start the Hive metastore, HBase,
> etc,
> > etc.). These are trickier to extract (they're not just JARs, but
> bin/hadoop
> > etc. etc.). We also need to be able to change these dependencies pretty
> > efficiently - the upstream ASF repo should use ASF-released artifacts
> here,
> > but downstream vendors (like Cloudera) will want to replace the ASF
> > artifacts with their own releases.
> >
> > Note that the Java binaries in thirdparty/ are *not* the compile-time
> > dependencies for Impala's Java frontend - those are resolved via Maven.
> > It's a bad thing that there's two dependency resolution mechanisms, but
> we
> > might not be able to solve that issue right now.
>
> >
> > So what should we do with the test dependencies? I see the following
> > options:
> >
> > 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> > dependency resolution comes from one place. *Cons:* native-toolchain
> would
> > change very frequently as new releases happen.
> >
> > 2. Don't provide any built-in mechanism for starting a test environment.
> If
> > you want to test Impala - set up your own Hadoop cluster instance.
> > *Pros:* removes
> > a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> > harder to run self-contained tests.
> >
> > 3. Have a separate test-dependencies repository that does basically the
> > same thing as the toolchain. *Pros:* separates out fast-moving
> dependencies
> > from slow-moving ones *Cons:* more moving parts. HDFS would need to be
> in
> > both repositories (as libhdfs is a compile-time dependency for the
> > backend).
> >
> > My preference is for option #1. We can do something like the following:
> >
> > * Add a cmake target to 'build' a test environment (resolve test
> > dependencies, start mini-cluster using checked-in scripts)
> > * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> > Hive and others just like compile-time dependencies. Update Impala's
> CMake
> > scripts to use those the local toolchain directory to find binaries,
> > management scripts etc.
> > * During each upstream release, add any new dependencies to
> > native-toolchain, and update impala.git/bin/impala-config.sh with the
> new
> > version numbers.
> >
> > What does everyone think?
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Re: Getting rid of thirdparty

Posted by Casey Ching <ca...@cloudera.com>.
I suspect we can actually run all the test services using the maven artifacts. Maybe we can investigate that?

There’s not enough information about #1. How do updates work? The nice thing about the current setup is anyone can checkout any commit and there’s a decent chance that checkout will build. Are we going to keep that ability? How does this work for Cloudera, Apache, and others, are we going to upload all test binaries to the same repo?


On March 10, 2016 at 10:52:07 AM, Jim Apple (jbapple@cloudera.com) wrote:
Both #1 and #3 seem reasonable to me. I think #2 should be avoided because 
the Con you listed will, I think, make contributing to Impala difficult for 
new contributors, and I think that's more serious than the Cons for #1 and 
#3. 

On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <he...@apache.org> wrote: 

> One of the tasks remaining before we can push Impala's code to the ASF's 
> git instance is to reduce the size of the repository. Right now even a 
> checkout of origin/cdh5-trunk is in the multi-GB range. 
> 
> The vast majority of that is in the thirdparty/ directory, which adds up 
> over the git history to be pretty huge with all the various versions we've 
> checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get 
> rid of thirdparty/ altogether. 
> 
> There are two main dependency types in thirdparty/. The first is a 
> compile-time C++ dependency like open-ldap or avro-c. These are (almost) 
> all superseded by the toolchain (see 
> https://github.com/cloudera/native-toolchain) build. A couple of 
> exceptions 
> are Squeasel and Mustache which don't produce their own libraries but are 
> source files directly included in the Impala build. I don't see a good 
> reason we couldn't move those to the toolchain as well. 
> 
> The other kind of dependency are the test binaries that are used when we 
> start Impala's test environment (i.e. start the Hive metastore, HBase, etc, 
> etc.). These are trickier to extract (they're not just JARs, but bin/hadoop 
> etc. etc.). We also need to be able to change these dependencies pretty 
> efficiently - the upstream ASF repo should use ASF-released artifacts here, 
> but downstream vendors (like Cloudera) will want to replace the ASF 
> artifacts with their own releases. 
> 
> Note that the Java binaries in thirdparty/ are *not* the compile-time 
> dependencies for Impala's Java frontend - those are resolved via Maven. 
> It's a bad thing that there's two dependency resolution mechanisms, but we 
> might not be able to solve that issue right now. 

> 
> So what should we do with the test dependencies? I see the following 
> options: 
> 
> 1. Put them in the native-toolchain repository. *Pros:* (almost) all 
> dependency resolution comes from one place. *Cons:* native-toolchain would 
> change very frequently as new releases happen. 
> 
> 2. Don't provide any built-in mechanism for starting a test environment. If 
> you want to test Impala - set up your own Hadoop cluster instance. 
> *Pros:* removes 
> a lot of complexity *Cons: *pushes a lot of work onto the user, makes it 
> harder to run self-contained tests. 
> 
> 3. Have a separate test-dependencies repository that does basically the 
> same thing as the toolchain. *Pros:* separates out fast-moving dependencies 
> from slow-moving ones *Cons:* more moving parts. HDFS would need to be in 
> both repositories (as libhdfs is a compile-time dependency for the 
> backend). 
> 
> My preference is for option #1. We can do something like the following: 
> 
> * Add a cmake target to 'build' a test environment (resolve test 
> dependencies, start mini-cluster using checked-in scripts) 
> * Add scripts to native-toolchain to download tarballs for HBase, HDFS, 
> Hive and others just like compile-time dependencies. Update Impala's CMake 
> scripts to use those the local toolchain directory to find binaries, 
> management scripts etc. 
> * During each upstream release, add any new dependencies to 
> native-toolchain, and update impala.git/bin/impala-config.sh with the new 
> version numbers. 
> 
> What does everyone think? 
> 

Re: Getting rid of thirdparty

Posted by Jim Apple <jb...@cloudera.com>.
Both #1 and #3 seem reasonable to me. I think #2 should be avoided because
the Con you listed will, I think, make contributing to Impala difficult for
new contributors, and I think that's more serious than the Cons for #1 and
#3.

On Thu, Mar 10, 2016 at 10:38 AM, Henry Robinson <he...@apache.org> wrote:

> One of the tasks remaining before we can push Impala's code to the ASF's
> git instance is to reduce the size of the repository. Right now even a
> checkout of origin/cdh5-trunk is in the multi-GB range.
>
> The vast majority of that is in the thirdparty/ directory, which adds up
> over the git history to be pretty huge with all the various versions we've
> checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
> rid of thirdparty/ altogether.
>
> There are two main dependency types in thirdparty/. The first is a
> compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> all superseded by the toolchain (see
> https://github.com/cloudera/native-toolchain) build. A couple of
> exceptions
> are Squeasel and Mustache which don't produce their own libraries but are
> source files directly included in the Impala build. I don't see a good
> reason we couldn't move those to the toolchain as well.
>
> The other kind of dependency are the test binaries that are used when we
> start Impala's test environment (i.e. start the Hive metastore, HBase, etc,
> etc.). These are trickier to extract (they're not just JARs, but bin/hadoop
> etc. etc.). We also need to be able to change these dependencies pretty
> efficiently - the upstream ASF repo should use ASF-released artifacts here,
> but downstream vendors (like Cloudera) will want to replace the ASF
> artifacts with their own releases.
>
> Note that the Java binaries in thirdparty/ are *not* the compile-time
> dependencies for Impala's Java frontend - those are resolved via Maven.
> It's a bad thing that there's two dependency resolution mechanisms, but we
> might not be able to solve that issue right now.
>
> So what should we do with the test dependencies? I see the following
> options:
>
> 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> dependency resolution comes from one place. *Cons:* native-toolchain would
> change very frequently as new releases happen.
>
> 2. Don't provide any built-in mechanism for starting a test environment. If
> you want to test Impala - set up your own Hadoop cluster instance.
> *Pros:* removes
> a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> harder to run self-contained tests.
>
> 3. Have a separate test-dependencies repository that does basically the
> same thing as the toolchain. *Pros:* separates out fast-moving dependencies
> from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
> both repositories (as libhdfs is a compile-time dependency for the
> backend).
>
> My preference is for option #1. We can do something like the following:
>
> * Add a cmake target to 'build' a test environment (resolve test
> dependencies, start mini-cluster using checked-in scripts)
> * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> Hive and others just like compile-time dependencies. Update Impala's CMake
> scripts to use those the local toolchain directory to find binaries,
> management scripts etc.
> * During each upstream release, add any new dependencies to
> native-toolchain, and update impala.git/bin/impala-config.sh with the new
> version numbers.
>
> What does everyone think?
>

Re: Getting rid of thirdparty

Posted by Tim Armstrong <ta...@cloudera.com>.
So my feeling on what we should be working towards:

   - Support for reproducible builds using the toolchain.
   - Support for building against the system versions of all dependencies
   (subject to whatever constraints about versions we agree on)
   - A straightforward way to set up a working test environment

I think native-toolchain is probably the way to go but I suspect we'll need
to make some changes to native-toolchain at some point:

   - Currently native-toolchain is designed to build every historical
   version of every package every time. At some point this will stop scaling
   as we add more packages and more versions.
   - We probably eventually need a way for users of native-toolchain to get
   their source packages from somewhere other than a cloudera-managed S3 bucket

I don't think we should use native-toolchain as a catch-all for all
dependencies. I think it's reasonable to add C++ libraries that we want to
be part of the reproducible native build, but I don't think it makes sense
to use the toolchain to download precompiled dependencies that won't be
part of the reproducible build. I.e. if buildall.sh doesn't build the
library from source using the toolchain's compiler, I don't think it should
be in the native toolchain.
libhdfs I think is a bit of a corner case in that it is native code that we
link with Impala that is part of the Hadoop distribution. We could move it
to the toolchain if we want to build it as a standalone library but I'm not
sure that necessarily makes sense.

- Tim

On Thu, Mar 10, 2016 at 11:03 AM, Henry Robinson <he...@cloudera.com> wrote:

> " the upstream ASF repo should use ASF-released artifacts here"
>
> While there's precedent elsewhere in the ASF for depending on downstream
> vendor-specific artifacts, I feel pretty strongly that there should be a
> clean separation between the ASF and downstream dependencies.
>
> I take your point about the flexibility of choosing which toolchain
> dependencies to take. Might be a good follow-on step to allow that
> (TOOLCHAIN_MODE={ALL, COMPILE, TEST}) or something, but we can wait to see
> if this is needed by the community.
>
> On 10 March 2016 at 10:59, Matthew Jacobs <mj...@cloudera.com> wrote:
>
> > Thanks for outlining these options. How does native-toolchain factor into
> > our ASF story? I.e. do we need it to be less Cloudera-project-oriented,
> or
> > is it OK for it to contain CDH (rather than Apache Hadoop) deployments?
> If
> > we're considering it to be more Cloudera-focused, it seems like it could
> > make upstream contributions difficult as there wouldn't really be a
> non-CDH
> > build/runtime toolchain. I guess upstream contributors could fork our
> > toolchain (or start their own) and replace the CDH components? If we
> detach
> > the compile-time dependencies and the test runtime projects, it would
> > probably make things easier for the rest of the world as they could
> easily
> > take the native-toolchain, the test environment, or both.
> >
> > On Thu, Mar 10, 2016 at 10:38 AM Henry Robinson <he...@apache.org>
> wrote:
> >
> > > One of the tasks remaining before we can push Impala's code to the
> ASF's
> > > git instance is to reduce the size of the repository. Right now even a
> > > checkout of origin/cdh5-trunk is in the multi-GB range.
> > >
> > > The vast majority of that is in the thirdparty/ directory, which adds
> up
> > > over the git history to be pretty huge with all the various versions
> > we've
> > > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we
> get
> > > rid of thirdparty/ altogether.
> > >
> > > There are two main dependency types in thirdparty/. The first is a
> > > compile-time C++ dependency like open-ldap or avro-c. These are
> (almost)
> > > all superseded by the toolchain (see
> > > https://github.com/cloudera/native-toolchain) build. A couple of
> > > exceptions
> > > are Squeasel and Mustache which don't produce their own libraries but
> are
> > > source files directly included in the Impala build. I don't see a good
> > > reason we couldn't move those to the toolchain as well.
> > >
> > > The other kind of dependency are the test binaries that are used when
> we
> > > start Impala's test environment (i.e. start the Hive metastore, HBase,
> > etc,
> > > etc.). These are trickier to extract (they're not just JARs, but
> > bin/hadoop
> > > etc. etc.). We also need to be able to change these dependencies pretty
> > > efficiently - the upstream ASF repo should use ASF-released artifacts
> > here,
> > > but downstream vendors (like Cloudera) will want to replace the ASF
> > > artifacts with their own releases.
> > >
> > > Note that the Java binaries in thirdparty/ are *not* the compile-time
> > > dependencies for Impala's Java frontend - those are resolved via Maven.
> > > It's a bad thing that there's two dependency resolution mechanisms, but
> > we
> > > might not be able to solve that issue right now.
> > >
> > > So what should we do with the test dependencies? I see the following
> > > options:
> > >
> > > 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> > > dependency resolution comes from one place. *Cons:* native-toolchain
> > would
> > > change very frequently as new releases happen.
> > >
> > > 2. Don't provide any built-in mechanism for starting a test
> environment.
> > If
> > > you want to test Impala - set up your own Hadoop cluster instance.
> > > *Pros:* removes
> > > a lot of complexity *Cons: *pushes a lot of work onto the user, makes
> it
> > > harder to run self-contained tests.
> > >
> > > 3. Have a separate test-dependencies repository that does basically the
> > > same thing as the toolchain. *Pros:* separates out fast-moving
> > dependencies
> > > from slow-moving ones *Cons:* more moving parts. HDFS would need to be
> in
> > > both repositories (as libhdfs is a compile-time dependency for the
> > > backend).
> > >
> > > My preference is for option #1. We can do something like the following:
> > >
> > > * Add a cmake target to 'build' a test environment (resolve test
> > > dependencies, start mini-cluster using checked-in scripts)
> > > * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> > > Hive and others just like compile-time dependencies. Update Impala's
> > CMake
> > > scripts to use those the local toolchain directory to find binaries,
> > > management scripts etc.
> > > * During each upstream release, add any new dependencies to
> > > native-toolchain, and update impala.git/bin/impala-config.sh with the
> new
> > > version numbers.
> > >
> > > What does everyone think?
> > >
> >
>

Re: Getting rid of thirdparty

Posted by Henry Robinson <he...@cloudera.com>.
" the upstream ASF repo should use ASF-released artifacts here"

While there's precedent elsewhere in the ASF for depending on downstream
vendor-specific artifacts, I feel pretty strongly that there should be a
clean separation between the ASF and downstream dependencies.

I take your point about the flexibility of choosing which toolchain
dependencies to take. Might be a good follow-on step to allow that
(TOOLCHAIN_MODE={ALL, COMPILE, TEST}) or something, but we can wait to see
if this is needed by the community.

On 10 March 2016 at 10:59, Matthew Jacobs <mj...@cloudera.com> wrote:

> Thanks for outlining these options. How does native-toolchain factor into
> our ASF story? I.e. do we need it to be less Cloudera-project-oriented, or
> is it OK for it to contain CDH (rather than Apache Hadoop) deployments? If
> we're considering it to be more Cloudera-focused, it seems like it could
> make upstream contributions difficult as there wouldn't really be a non-CDH
> build/runtime toolchain. I guess upstream contributors could fork our
> toolchain (or start their own) and replace the CDH components? If we detach
> the compile-time dependencies and the test runtime projects, it would
> probably make things easier for the rest of the world as they could easily
> take the native-toolchain, the test environment, or both.
>
> On Thu, Mar 10, 2016 at 10:38 AM Henry Robinson <he...@apache.org> wrote:
>
> > One of the tasks remaining before we can push Impala's code to the ASF's
> > git instance is to reduce the size of the repository. Right now even a
> > checkout of origin/cdh5-trunk is in the multi-GB range.
> >
> > The vast majority of that is in the thirdparty/ directory, which adds up
> > over the git history to be pretty huge with all the various versions
> we've
> > checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
> > rid of thirdparty/ altogether.
> >
> > There are two main dependency types in thirdparty/. The first is a
> > compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> > all superseded by the toolchain (see
> > https://github.com/cloudera/native-toolchain) build. A couple of
> > exceptions
> > are Squeasel and Mustache which don't produce their own libraries but are
> > source files directly included in the Impala build. I don't see a good
> > reason we couldn't move those to the toolchain as well.
> >
> > The other kind of dependency are the test binaries that are used when we
> > start Impala's test environment (i.e. start the Hive metastore, HBase,
> etc,
> > etc.). These are trickier to extract (they're not just JARs, but
> bin/hadoop
> > etc. etc.). We also need to be able to change these dependencies pretty
> > efficiently - the upstream ASF repo should use ASF-released artifacts
> here,
> > but downstream vendors (like Cloudera) will want to replace the ASF
> > artifacts with their own releases.
> >
> > Note that the Java binaries in thirdparty/ are *not* the compile-time
> > dependencies for Impala's Java frontend - those are resolved via Maven.
> > It's a bad thing that there's two dependency resolution mechanisms, but
> we
> > might not be able to solve that issue right now.
> >
> > So what should we do with the test dependencies? I see the following
> > options:
> >
> > 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> > dependency resolution comes from one place. *Cons:* native-toolchain
> would
> > change very frequently as new releases happen.
> >
> > 2. Don't provide any built-in mechanism for starting a test environment.
> If
> > you want to test Impala - set up your own Hadoop cluster instance.
> > *Pros:* removes
> > a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> > harder to run self-contained tests.
> >
> > 3. Have a separate test-dependencies repository that does basically the
> > same thing as the toolchain. *Pros:* separates out fast-moving
> dependencies
> > from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
> > both repositories (as libhdfs is a compile-time dependency for the
> > backend).
> >
> > My preference is for option #1. We can do something like the following:
> >
> > * Add a cmake target to 'build' a test environment (resolve test
> > dependencies, start mini-cluster using checked-in scripts)
> > * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> > Hive and others just like compile-time dependencies. Update Impala's
> CMake
> > scripts to use those the local toolchain directory to find binaries,
> > management scripts etc.
> > * During each upstream release, add any new dependencies to
> > native-toolchain, and update impala.git/bin/impala-config.sh with the new
> > version numbers.
> >
> > What does everyone think?
> >
>

Re: Getting rid of thirdparty

Posted by Matthew Jacobs <mj...@cloudera.com>.
Thanks for outlining these options. How does native-toolchain factor into
our ASF story? I.e. do we need it to be less Cloudera-project-oriented, or
is it OK for it to contain CDH (rather than Apache Hadoop) deployments? If
we're considering it to be more Cloudera-focused, it seems like it could
make upstream contributions difficult as there wouldn't really be a non-CDH
build/runtime toolchain. I guess upstream contributors could fork our
toolchain (or start their own) and replace the CDH components? If we detach
the compile-time dependencies and the test runtime projects, it would
probably make things easier for the rest of the world as they could easily
take the native-toolchain, the test environment, or both.

On Thu, Mar 10, 2016 at 10:38 AM Henry Robinson <he...@apache.org> wrote:

> One of the tasks remaining before we can push Impala's code to the ASF's
> git instance is to reduce the size of the repository. Right now even a
> checkout of origin/cdh5-trunk is in the multi-GB range.
>
> The vast majority of that is in the thirdparty/ directory, which adds up
> over the git history to be pretty huge with all the various versions we've
> checked in. Removing it shrinks cdh5-trunk to ~200MB. So I propose we get
> rid of thirdparty/ altogether.
>
> There are two main dependency types in thirdparty/. The first is a
> compile-time C++ dependency like open-ldap or avro-c. These are (almost)
> all superseded by the toolchain (see
> https://github.com/cloudera/native-toolchain) build. A couple of
> exceptions
> are Squeasel and Mustache which don't produce their own libraries but are
> source files directly included in the Impala build. I don't see a good
> reason we couldn't move those to the toolchain as well.
>
> The other kind of dependency are the test binaries that are used when we
> start Impala's test environment (i.e. start the Hive metastore, HBase, etc,
> etc.). These are trickier to extract (they're not just JARs, but bin/hadoop
> etc. etc.). We also need to be able to change these dependencies pretty
> efficiently - the upstream ASF repo should use ASF-released artifacts here,
> but downstream vendors (like Cloudera) will want to replace the ASF
> artifacts with their own releases.
>
> Note that the Java binaries in thirdparty/ are *not* the compile-time
> dependencies for Impala's Java frontend - those are resolved via Maven.
> It's a bad thing that there's two dependency resolution mechanisms, but we
> might not be able to solve that issue right now.
>
> So what should we do with the test dependencies? I see the following
> options:
>
> 1. Put them in the native-toolchain repository. *Pros:* (almost) all
> dependency resolution comes from one place. *Cons:* native-toolchain would
> change very frequently as new releases happen.
>
> 2. Don't provide any built-in mechanism for starting a test environment. If
> you want to test Impala - set up your own Hadoop cluster instance.
> *Pros:* removes
> a lot of complexity *Cons: *pushes a lot of work onto the user, makes it
> harder to run self-contained tests.
>
> 3. Have a separate test-dependencies repository that does basically the
> same thing as the toolchain. *Pros:* separates out fast-moving dependencies
> from slow-moving ones *Cons:* more moving parts. HDFS would need to be in
> both repositories (as libhdfs is a compile-time dependency for the
> backend).
>
> My preference is for option #1. We can do something like the following:
>
> * Add a cmake target to 'build' a test environment (resolve test
> dependencies, start mini-cluster using checked-in scripts)
> * Add scripts to native-toolchain to download tarballs for HBase, HDFS,
> Hive and others just like compile-time dependencies. Update Impala's CMake
> scripts to use those the local toolchain directory to find binaries,
> management scripts etc.
> * During each upstream release, add any new dependencies to
> native-toolchain, and update impala.git/bin/impala-config.sh with the new
> version numbers.
>
> What does everyone think?
>