You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by Atul Gupta <at...@expedia.com> on 2014/10/03 20:03:44 UTC

Sqoop

Hi,

Let me give you some background of the project first, Currently we are using in-house tool for moving the data from RDBMS to HDFS and based on our business requirements we added lots of new features in that tool. The current in house solution is not scalable and have maintainability issues also, so two months back we decided to move on Sqoop. When we did the feature gap analysis between in-house tool and Sqoop, we found that most of the in-house developed features are missing in the Sqoop.  then we decided that we should do customization around Sqoop. We also have plans to contribute back to open source community. Following are the list of features:

  1.  Sqoop currently doesn't support dynamic partitions. We are planning to support dynamic partitions. Basically, user can specify the partition column and sqoop will figure out the partition details and create partitions if they don't exist.
  2.  Sqoop only supports Partitions String fields. We are planning to extend it to some other data types like integer, Date etc.
  3.  Sqoop doesn't support data merge for hive tables,  specially if they are partitioned. We are planning to support merge for hive tables.
  4.  Sqoop doesn't restrict the maximum load for a given mapper and because of it sometimes it becomes overloaded and performance issues. We are planning to add Volume per mapper control for Sqoop.
  5.  Sqoop doesn't support external table for hive. We are planning to add this feature as well
  6.  Merge can be done only on one key. We will be enhancing it to support multiple field keys for merge.
These are at high level and there are few others also. Team is ready to work with Sqoop dev community and aware about the process, but we have following open questions in our mind that would really help us in taking the final call.


1.       Can developer create and branch with Sqoop and start its development directly?

2.       Who decide the timelines of the features delivery ?

3.       What is expected release date of Sqoop 1.4.6?

4.       Who decides the feature priorities?

5.       In case feature priorities are decided by product owner, can we negotiate with PM on feature priorities?

6.       Once development work will be completed then who will do the code review?

7.       Who will create the documentation?

In case you need more clarity, we are ready to setup webex/skype call with you.

Thanks,
Atul Gupta
Engineering Manager
Expedia Inc

Re: Sqoop

Posted by Rakesh Sharma <ra...@expedia.com>.
++Atul

From: Venkat Ranganathan <vr...@hortonworks.com>>
Date: Friday, October 3, 2014 at 11:55 PM
To: "dev@sqoop.apache.org<ma...@sqoop.apache.org>" <de...@sqoop.apache.org>>
Cc: Rakesh Sharma <ra...@expedia.com>>, Shashank Tandon <st...@expedia.com>>
Subject: Re: Sqoop

Atul Gupta

Please see below


>> Sqoop currently doesn't support dynamic partitions. We are planning to support dynamic partitions. Basically, user can specify the partition column and sqoop will figure out the partition details and create partitions if they don't exist.

Dynamic partition has been part of Sqoop for a while as part of hcatalog support.

>>.  Sqoop only supports Partitions String fields. We are planning to extend it to some other data types like integer, Date etc.
Even with hcatalog integration (and the enhancements to this integration we did to support all hive types), this is an outstanding issue.   Being fixed in hcatalog also

>>  5.  Sqoop doesn't support external table for hive. We are planning to add this feature as well

This is also addressed by the hcatalog integration

Venkat

On Fri, Oct 3, 2014 at 11:03 AM, Atul Gupta <at...@expedia.com>> wrote:
Hi,

Let me give you some background of the project first, Currently we are using in-house tool for moving the data from RDBMS to HDFS and based on our business requirements we added lots of new features in that tool. The current in house solution is not scalable and have maintainability issues also, so two months back we decided to move on Sqoop. When we did the feature gap analysis between in-house tool and Sqoop, we found that most of the in-house developed features are missing in the Sqoop.  then we decided that we should do customization around Sqoop. We also have plans to contribute back to open source community. Following are the list of features:

  1.  Sqoop currently doesn't support dynamic partitions. We are planning to support dynamic partitions. Basically, user can specify the partition column and sqoop will figure out the partition details and create partitions if they don't exist.
  2.  Sqoop only supports Partitions String fields. We are planning to extend it to some other data types like integer, Date etc.
  3.  Sqoop doesn't support data merge for hive tables,  specially if they are partitioned. We are planning to support merge for hive tables.
  4.  Sqoop doesn't restrict the maximum load for a given mapper and because of it sometimes it becomes overloaded and performance issues. We are planning to add Volume per mapper control for Sqoop.
  5.  Sqoop doesn't support external table for hive. We are planning to add this feature as well
  6.  Merge can be done only on one key. We will be enhancing it to support multiple field keys for merge.
These are at high level and there are few others also. Team is ready to work with Sqoop dev community and aware about the process, but we have following open questions in our mind that would really help us in taking the final call.


1.       Can developer create and branch with Sqoop and start its development directly?

2.       Who decide the timelines of the features delivery ?

3.       What is expected release date of Sqoop 1.4.6?

4.       Who decides the feature priorities?

5.       In case feature priorities are decided by product owner, can we negotiate with PM on feature priorities?

6.       Once development work will be completed then who will do the code review?

7.       Who will create the documentation?

In case you need more clarity, we are ready to setup webex/skype call with you.

Thanks,
Atul Gupta
Engineering Manager
Expedia Inc


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Sqoop

Posted by Venkat Ranganathan <vr...@hortonworks.com>.
Atul Gupta

Please see below


>> Sqoop currently doesn't support dynamic partitions. We are planning to
support dynamic partitions. Basically, user can specify the partition
column and sqoop will figure out the partition details and create
partitions if they don't exist.

Dynamic partition has been part of Sqoop for a while as part of hcatalog
support.

>>.  Sqoop only supports Partitions String fields. We are planning to
extend it to some other data types like integer, Date etc.
Even with hcatalog integration (and the enhancements to this integration we
did to support all hive types), this is an outstanding issue.   Being fixed
in hcatalog also

>>  5.  Sqoop doesn't support external table for hive. We are planning to
add this feature as well

This is also addressed by the hcatalog integration

Venkat

On Fri, Oct 3, 2014 at 11:03 AM, Atul Gupta <at...@expedia.com> wrote:

> Hi,
>
> Let me give you some background of the project first, Currently we are
> using in-house tool for moving the data from RDBMS to HDFS and based on our
> business requirements we added lots of new features in that tool. The
> current in house solution is not scalable and have maintainability issues
> also, so two months back we decided to move on Sqoop. When we did the
> feature gap analysis between in-house tool and Sqoop, we found that most of
> the in-house developed features are missing in the Sqoop.  then we decided
> that we should do customization around Sqoop. We also have plans to
> contribute back to open source community. Following are the list of
> features:
>
>   1.  Sqoop currently doesn't support dynamic partitions. We are planning
> to support dynamic partitions. Basically, user can specify the partition
> column and sqoop will figure out the partition details and create
> partitions if they don't exist.
>   2.  Sqoop only supports Partitions String fields. We are planning to
> extend it to some other data types like integer, Date etc.
>   3.  Sqoop doesn't support data merge for hive tables,  specially if they
> are partitioned. We are planning to support merge for hive tables.
>   4.  Sqoop doesn't restrict the maximum load for a given mapper and
> because of it sometimes it becomes overloaded and performance issues. We
> are planning to add Volume per mapper control for Sqoop.
>   5.  Sqoop doesn't support external table for hive. We are planning to
> add this feature as well
>   6.  Merge can be done only on one key. We will be enhancing it to
> support multiple field keys for merge.
> These are at high level and there are few others also. Team is ready to
> work with Sqoop dev community and aware about the process, but we have
> following open questions in our mind that would really help us in taking
> the final call.
>
>
> 1.       Can developer create and branch with Sqoop and start its
> development directly?
>
> 2.       Who decide the timelines of the features delivery ?
>
> 3.       What is expected release date of Sqoop 1.4.6?
>
> 4.       Who decides the feature priorities?
>
> 5.       In case feature priorities are decided by product owner, can we
> negotiate with PM on feature priorities?
>
> 6.       Once development work will be completed then who will do the code
> review?
>
> 7.       Who will create the documentation?
>
> In case you need more clarity, we are ready to setup webex/skype call with
> you.
>
> Thanks,
> Atul Gupta
> Engineering Manager
> Expedia Inc
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.