You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Aaron Kimball <aa...@cloudera.com> on 2010/03/29 21:02:54 UTC

Sqoop is moving to github!

Hi Hadoop, Hive, and Sqoop users,

For the past year, the Apache Hadoop MapReduce project has played host to
Sqoop, a command-line tool that performs parallel imports and exports
between relational databases and HDFS. We've developed a lot of features and
gotten a lot of great feedback from users. While Sqoop was a contrib project
in Hadoop, it has been steadily improved and grown.

But the contrib directory is a home for new or small projects incubating
underneath Hadoop's umbrella. Sqoop is starting to look less like a small
project these days. In particular, a feature that has been growing in
importance for Sqoop is its ability to integrate with Hive. In order to
facilitate this integration from a compilation and testing standpoint, we've
pulled Sqoop out of contrib and into its own repository hosted on github.

You can download all the relevant bits here:
http://www.github.com/cloudera/sqoop

The code there will run in conjunction with the Apache Hadoop trunk source.
(Compatibility with other distributions/versions is forthcoming.)

While we've changed hosts, Sqoop will keep the same license -- future
improvements will continue to remain Apache 2.0-licensed. We welcome the
contributions of all in the open source community; there's a lot of exciting
work still to be done! If you'd like to help out but aren't sure where to
start, send me an email and I can recommend a few areas where improvements
would be appreciated.

Want some more information about Sqoop? An introduction is available here:
http://www.cloudera.com/sqoop
A ready-to-run release of Sqoop is included with Cloudera's Distribution for
Hadoop: http://archive.cloudera.com
And its reference manual is available for browsing at
http://archive.cloudera.com/docs/sqoop

If you have any questions about this move process, please ask me.

Regards,
- Aaron Kimball
Cloudera, Inc.

Re: Sqoop is moving to github!

Posted by Owen O'Malley <ow...@gmail.com>.

On Tue, Mar 30, 2010 at 7:54 AM, Bernd Fondermann <
bernd.fondermann@googlemail.com> wrote:

Where has this been discussed? Where's the PMC-level vote/record of
> consensus to ratify to abandon the code?
>

The (lack of) discussion is in the jira and this thread. In Hadoop, we treat
adding and removing contrib projects as simple code additions or removals
that use lazy consensus among the committers. If someone wants to block it,
the previously mentioned MAPREDUCE-1644 is the proper avenue. So far, no one
in the project has vetoed the removal.

-- Owen

Re: Sqoop is moving to github!

Posted by Bernd Fondermann <be...@googlemail.com>.

On Tue, Mar 30, 2010 at 19:58, Aaron Kimball <aa...@cloudera.com> wrote:
> On Tue, Mar 30, 2010 at 7:54 AM, Bernd Fondermann <
> bernd.fondermann@googlemail.com> wrote:
>
>> On Tue, Mar 30, 2010 at 15:48, Owen O'Malley <ow...@gmail.com>
>> wrote:
>> > On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
>> > bernd.fondermann@googlemail.com> wrote:
>> >
>> >>
>> >> @Hadoop PMC: What is your statement on this fork announcement?
>> >
>> >
>> > Aaron has been the primary developer of Sqoop. He has asked for it to be
>> > removed from MapReduce's contrib. If a committer -1's the code removal,
>> it
>> > will cause a fork. But that hasn't happened and I don't think it will.
>>
>> It's not the contributor's code alone anymore, once it's been committed.
>> Where has this been discussed? Where's the PMC-level vote/record of
>> consensus to ratify to abandon the code?
>> (ML + TS will be sufficient and I'll find my way.)
>>
>>
> I believe that technically the discussion (such as it is) on MAPREDUCE-1644
> will stand as the record of vote. It's still open for new comments.

No, I don't think so. This isn't the record of anything else than your
intentions.
But that's shouldn't be your concern.

  Bernd

Re: Sqoop is moving to github!

Posted by Aaron Kimball <aa...@cloudera.com>.

On Tue, Mar 30, 2010 at 7:54 AM, Bernd Fondermann <
bernd.fondermann@googlemail.com> wrote:

> On Tue, Mar 30, 2010 at 15:48, Owen O'Malley <ow...@gmail.com>
> wrote:
> > On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
> > bernd.fondermann@googlemail.com> wrote:
> >
> >>
> >> @Hadoop PMC: What is your statement on this fork announcement?
> >
> >
> > Aaron has been the primary developer of Sqoop. He has asked for it to be
> > removed from MapReduce's contrib. If a committer -1's the code removal,
> it
> > will cause a fork. But that hasn't happened and I don't think it will.
>
> It's not the contributor's code alone anymore, once it's been committed.
> Where has this been discussed? Where's the PMC-level vote/record of
> consensus to ratify to abandon the code?
> (ML + TS will be sufficient and I'll find my way.)
>
>
I believe that technically the discussion (such as it is) on MAPREDUCE-1644
will stand as the record of vote. It's still open for new comments.


> > In
> > order for it to make sense to keep the code in Hadoop, someone needs to
> be
> > working on it and the only person working on it is Aaron. If someone
> wants
> > to make a case for keeping Sqoop in MapReduce, please make it in the
> jira:
> > https://issues.apache.org/jira/browse/MAPREDUCE-1644
>
> Well, at least someone reviewed and committed all the contributions to
> svn in the first place, he can't be the only one working on it.
>
>
Indeed. Hadoop committers have checked in all my contributions; Tom White
has done the lion's share of this work.



>  > The sub-projects must avoid circular dependencies. We must be able to
> build
> > the sub-projects in order without having to go back and update a previous
> > project. Making a component of MapReduce depend on Hive (or Pig) would
> cause
> > that kind of cycle. Some projects like Zebra just live in the upper
> project
> > (Pig in that case). The other option is to ask to be a sub-project, ask
> to
> > join Apache incubator, or go to Github. Github doesn't give him the legal
> or
> > community aspects of Apache, but it also doesn't require asking
> permission
> > or introduce any process requirements.
>
> Well, I'm not looking at this from the contributor's perspective here.
> He can do with the code whatever he likes, basically.
> It's only the PMC's viewpoint which is interesting for me.
>
>  Bernd
>

- Aaron

Re: Sqoop is moving to github!

Posted by Bernd Fondermann <be...@googlemail.com>.

On Tue, Mar 30, 2010 at 15:48, Owen O'Malley <ow...@gmail.com> wrote:
> On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
> bernd.fondermann@googlemail.com> wrote:
>
>>
>> @Hadoop PMC: What is your statement on this fork announcement?
>
>
> Aaron has been the primary developer of Sqoop. He has asked for it to be
> removed from MapReduce's contrib. If a committer -1's the code removal, it
> will cause a fork. But that hasn't happened and I don't think it will.

It's not the contributor's code alone anymore, once it's been committed.
Where has this been discussed? Where's the PMC-level vote/record of
consensus to ratify to abandon the code?
(ML + TS will be sufficient and I'll find my way.)

> In
> order for it to make sense to keep the code in Hadoop, someone needs to be
> working on it and the only person working on it is Aaron. If someone wants
> to make a case for keeping Sqoop in MapReduce, please make it in the jira:
> https://issues.apache.org/jira/browse/MAPREDUCE-1644

Well, at least someone reviewed and committed all the contributions to
svn in the first place, he can't be the only one working on it.

> The sub-projects must avoid circular dependencies. We must be able to build
> the sub-projects in order without having to go back and update a previous
> project. Making a component of MapReduce depend on Hive (or Pig) would cause
> that kind of cycle. Some projects like Zebra just live in the upper project
> (Pig in that case). The other option is to ask to be a sub-project, ask to
> join Apache incubator, or go to Github. Github doesn't give him the legal or
> community aspects of Apache, but it also doesn't require asking permission
> or introduce any process requirements.

Well, I'm not looking at this from the contributor's perspective here.
He can do with the code whatever he likes, basically.
It's only the PMC's viewpoint which is interesting for me.

  Bernd

Re: Sqoop is moving to github!

Posted by Owen O'Malley <ow...@gmail.com>.

On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
bernd.fondermann@googlemail.com> wrote:

>
> @Hadoop PMC: What is your statement on this fork announcement?

Aaron has been the primary developer of Sqoop. He has asked for it to be
removed from MapReduce's contrib. If a committer -1's the code removal, it
will cause a fork. But that hasn't happened and I don't think it will. In
order for it to make sense to keep the code in Hadoop, someone needs to be
working on it and the only person working on it is Aaron. If someone wants
to make a case for keeping Sqoop in MapReduce, please make it in the jira:
https://issues.apache.org/jira/browse/MAPREDUCE-1644

The sub-projects must avoid circular dependencies. We must be able to build
the sub-projects in order without having to go back and update a previous
project. Making a component of MapReduce depend on Hive (or Pig) would cause
that kind of cycle. Some projects like Zebra just live in the upper project
(Pig in that case). The other option is to ask to be a sub-project, ask to
join Apache incubator, or go to Github. Github doesn't give him the legal or
community aspects of Apache, but it also doesn't require asking permission
or introduce any process requirements.

-- Owen

Re: Sqoop is moving to github!

Posted by Aaron Kimball <aa...@cloudera.com>.

Hi Bernd,

These are some important questions about the status of the project; thanks
for asking them. I'll go through them inline to preserve context on each
one.

On Tue, Mar 30, 2010 at 1:55 AM, Bernd Fondermann <
bernd.fondermann@googlemail.com> wrote:

> Hi Aaron,
>
> Good to see you are a contributor to sqoop. Are you a committer yet?
> Do you haven an ICLA on file with the ASF? I cannot find any record of it.
> I must be missing something here, since PMCs are normally requesting
> ICLAs from people making such substantial code contributions.
>
>
It might be more precise to call me *the* contributor to Sqoop. I've written
about 98% of the code for it; a few other individuals have provided me with
small enhancements or bugfixes, but the overwhelming amount of its care and
feeding has been under my watch.

I am not a committer on the Hadoop MapReduce (or any other ASF) project.
Thus far, nobody has invited me to sign an ICLA with my contributor-only
status. I have relied on others (primarily Tom White) to actually commit all
the Sqoop patches to svn.

> On Mon, Mar 29, 2010 at 21:02, Aaron Kimball <aa...@cloudera.com> wrote:
> > Hi Hadoop, Hive, and Sqoop users,
> >
> > For the past year, the Apache Hadoop MapReduce project has played host to
> > Sqoop, a command-line tool that performs parallel imports and exports
> > between relational databases and HDFS. We've developed a lot of features
> and
> > gotten a lot of great feedback from users.
>
> Who is "we" exactly? Cloudera? Hadoop? You?
>
>
Both myself and Cloudera. As said above, the vast majority of the direct
work on the project has been my own. But there are others at Cloudera who
have helped in less visible fashion with feature prioritization, design
input, code review, QA, user support, etc. And the contributions I make to
Sqoop, I do so as an employee of Cloudera.

> > While Sqoop was a contrib project
> > in Hadoop, it has been steadily improved and grown.
>
> Cool.
>
> > But the contrib directory is a home for new or small projects incubating
> > underneath Hadoop's umbrella. Sqoop is starting to look less like a small
> > project these days. In particular, a feature that has been growing in
> > importance for Sqoop is its ability to integrate with Hive. In order to
> > facilitate this integration from a compilation and testing standpoint,
> we've
> > pulled Sqoop out of contrib and into its own repository hosted on github.
>
> So, you are forking sqoop. To facilitate that an Hadoop project can
> work with another Hadoop project.
> What are the issues with Hadoop that you cannot do it within Hadoop itself?
>
>
When you put it like that, "forking" seems like a bit of a strong term. As
said in my original email, I prefer to think of it as "moving." (See the
next answer below for more on this.)

I believe Owen has already described some of the technical problems.
Conflating Sqoop's source repository with Hadoop's causes unnecessary
circular dependencies that build tools cannot easily work around. The more
straightforward method is to factor out Sqoop into a separate source
repository.

> > You can download all the relevant bits here:
> > http://www.github.com/cloudera/sqoop
> >
> > The code there will run in conjunction with the Apache Hadoop trunk
> source.
> > (Compatibility with other distributions/versions is forthcoming.)
>
> Sqoop is in ASF svn. What do you do when someone is going to continue
> developing it here.
> Then there's a naming clash. Do you intend to rename your fork?
>

I have filed MAPREDUCE-1644 with a patch that completely removes Sqoop from
the MapReduce repository. This will remove Sqoop from the working copy of
the repository, but of course, it will still belong to the ASF's repository
history. (Thus, I hope this will be seen as a straightforward lateral move
more than a fork.)

It's worth pointing out that Sqoop was originally introduced in HADOOP-5815,
committed after 0.20 was branched for release and closed to new features. So
Sqoop has only existed on unreleased development branches in ASF svn. As
such, removing new features from the working copy is still allowed. As
Hadoop is gearing up for a new release, now is the time to consider whether
side-projects like this belong in the same umbrella project.

Others can -1 the removal patch and force a copy of Sqoop to remain in ASF.
This will force Sqoop to be bundled with the impending Hadoop 0.21 release.
However, I do not intend to rename Sqoop. I also intend to do all feature
and bugfix development in the new repository on github. I will be monitoring
the issue tracker on github for bug reports and feature requests. For
someone else to seriously -1 MAPREDUCE-1644, they'd need to be willing to
fix bugs in the ASF copy themselves, or cross-port the patches I develop at
github and graft them on to the ASF copy.

If others are interested in contributing to Sqoop and would like to take on
a role in the project, I welcome them to come help me out at github, rather
than force a true fork to occur and work within MapReduce svn. If enough
people want to work on Sqoop but feel strongly that we should remain in the
ASF (e.g., by introducing a new project in the incubator), I'm certainly
open to listening to that point of view. But that's a separate discussion
from this one.

>
> > While we've changed hosts, Sqoop will keep the same license -- future
> > improvements will continue to remain Apache 2.0-licensed. We welcome the
> > contributions of all in the open source community; there's a lot of
> exciting
> > work still to be done! If you'd like to help out but aren't sure where to
> > start, send me an email and I can recommend a few areas where
> improvements
> > would be appreciated.
>
> Who is "we" in this case? The same "we" as above?
>
>
Indeed.

> @Hadoop PMC: What is your statement on this fork announcement?
>
> Thanks for clarifying,
>
>  Bernd
>

Regards,
- Aaron Kimball

Re: Sqoop is moving to github!

Posted by Bernd Fondermann <be...@googlemail.com>.

Hi Aaron,

Good to see you are a contributor to sqoop. Are you a committer yet?
Do you haven an ICLA on file with the ASF? I cannot find any record of it.
I must be missing something here, since PMCs are normally requesting
ICLAs from people making such substantial code contributions.

On Mon, Mar 29, 2010 at 21:02, Aaron Kimball <aa...@cloudera.com> wrote:
> Hi Hadoop, Hive, and Sqoop users,
>
> For the past year, the Apache Hadoop MapReduce project has played host to
> Sqoop, a command-line tool that performs parallel imports and exports
> between relational databases and HDFS. We've developed a lot of features and
> gotten a lot of great feedback from users.

Who is "we" exactly? Cloudera? Hadoop? You?

> While Sqoop was a contrib project
> in Hadoop, it has been steadily improved and grown.

Cool.

> But the contrib directory is a home for new or small projects incubating
> underneath Hadoop's umbrella. Sqoop is starting to look less like a small
> project these days. In particular, a feature that has been growing in
> importance for Sqoop is its ability to integrate with Hive. In order to
> facilitate this integration from a compilation and testing standpoint, we've
> pulled Sqoop out of contrib and into its own repository hosted on github.

So, you are forking sqoop. To facilitate that an Hadoop project can
work with another Hadoop project.
What are the issues with Hadoop that you cannot do it within Hadoop itself?

> You can download all the relevant bits here:
> http://www.github.com/cloudera/sqoop
>
> The code there will run in conjunction with the Apache Hadoop trunk source.
> (Compatibility with other distributions/versions is forthcoming.)

Sqoop is in ASF svn. What do you do when someone is going to continue
developing it here.
Then there's a naming clash. Do you intend to rename your fork?

> While we've changed hosts, Sqoop will keep the same license -- future
> improvements will continue to remain Apache 2.0-licensed. We welcome the
> contributions of all in the open source community; there's a lot of exciting
> work still to be done! If you'd like to help out but aren't sure where to
> start, send me an email and I can recommend a few areas where improvements
> would be appreciated.

Who is "we" in this case? The same "we" as above?

@Hadoop PMC: What is your statement on this fork announcement?

Thanks for clarifying,

  Bernd

Re: Sqoop is moving to github!

Posted by Aaron Kimball <aa...@cloudera.com>.

Hi Raghu,

github has issue tracking support built-in. I've enabled issue tracking for
the project, so you should be able to create reports there.

- Aaron

On Mon, Mar 29, 2010 at 12:06 PM, Raghu Murthy <rm...@facebook.com> wrote:

> Hi Aaron,
>
> Where will you track bugs/features/improvements on sqoop?
>
> Thanks,
> raghu
>
> On 3/29/10 12:02 PM, "Aaron Kimball" <aa...@cloudera.com> wrote:
>
> > Hi Hadoop, Hive, and Sqoop users,
> >
> > For the past year, the Apache Hadoop MapReduce project has played host to
> > Sqoop, a command-line tool that performs parallel imports and exports
> between
> > relational databases and HDFS. We've developed a lot of features and
> gotten a
> > lot of great feedback from users. While Sqoop was a contrib project in
> Hadoop,
> > it has been steadily improved and grown.
> >
> > But the contrib directory is a home for new or small projects incubating
> > underneath Hadoop's umbrella. Sqoop is starting to look less like a small
> > project these days. In particular, a feature that has been growing in
> > importance for Sqoop is its ability to integrate with Hive. In order to
> > facilitate this integration from a compilation and testing standpoint,
> we've
> > pulled Sqoop out of contrib and into its own repository hosted on github.
> >
> > You can download all the relevant bits here:
> > http://www.github.com/cloudera/sqoop
> >
> > The code there will run in conjunction with the Apache Hadoop trunk
> source.
> > (Compatibility with other distributions/versions is forthcoming.)
> >
> > While we've changed hosts, Sqoop will keep the same license -- future
> > improvements will continue to remain Apache 2.0-licensed. We welcome the
> > contributions of all in the open source community; there's a lot of
> exciting
> > work still to be done! If you'd like to help out but aren't sure where to
> > start, send me an email and I can recommend a few areas where
> improvements
> > would be appreciated.
> >
> > Want some more information about Sqoop? An introduction is available
> here:
> > http://www.cloudera.com/sqoop
> > A ready-to-run release of Sqoop is included with Cloudera's Distribution
> for
> > Hadoop: http://archive.cloudera.com
> > And its reference manual is available for browsing at
> > http://archive.cloudera.com/docs/sqoop
> >
> > If you have any questions about this move process, please ask me.
> >
> > Regards,
> > - Aaron Kimball
> > Cloudera, Inc.
> >
>
>

Re: Sqoop is moving to github!

Posted by Raghu Murthy <rm...@facebook.com>.

Hi Aaron,

Where will you track bugs/features/improvements on sqoop?

Thanks,
raghu

On 3/29/10 12:02 PM, "Aaron Kimball" <aa...@cloudera.com> wrote:

> Hi Hadoop, Hive, and Sqoop users,
> 
> For the past year, the Apache Hadoop MapReduce project has played host to
> Sqoop, a command-line tool that performs parallel imports and exports between
> relational databases and HDFS. We've developed a lot of features and gotten a
> lot of great feedback from users. While Sqoop was a contrib project in Hadoop,
> it has been steadily improved and grown.
> 
> But the contrib directory is a home for new or small projects incubating
> underneath Hadoop's umbrella. Sqoop is starting to look less like a small
> project these days. In particular, a feature that has been growing in
> importance for Sqoop is its ability to integrate with Hive. In order to
> facilitate this integration from a compilation and testing standpoint, we've
> pulled Sqoop out of contrib and into its own repository hosted on github.
> 
> You can download all the relevant bits here:
> http://www.github.com/cloudera/sqoop
> 
> The code there will run in conjunction with the Apache Hadoop trunk source.
> (Compatibility with other distributions/versions is forthcoming.)
> 
> While we've changed hosts, Sqoop will keep the same license -- future
> improvements will continue to remain Apache 2.0-licensed. We welcome the
> contributions of all in the open source community; there's a lot of exciting
> work still to be done! If you'd like to help out but aren't sure where to
> start, send me an email and I can recommend a few areas where improvements
> would be appreciated.
> 
> Want some more information about Sqoop? An introduction is available here:
> http://www.cloudera.com/sqoop
> A ready-to-run release of Sqoop is included with Cloudera's Distribution for
> Hadoop: http://archive.cloudera.com
> And its reference manual is available for browsing at
> http://archive.cloudera.com/docs/sqoop
> 
> If you have any questions about this move process, please ask me.
> 
> Regards,
> - Aaron Kimball
> Cloudera, Inc.
>