You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Jack Arenas <j...@ckarenas.com> on 2015/03/02 21:24:51 UTC

Submitting Sqoop jobs in parallel

Hi team,

I'm building an ETL tool that requires me to pull in a bunch of tables from a db into HDFS and I'm currently doing this sequentially using Sqoop. I figured it might be a faster to submit the Sqoop jobs in parallel, that is with a predefined thread pool (currently trying 8) because it took about two hours to ingest 150 tables of various sizes, frankly not very big tables as this is POC. So sequentially this works fine, but as soon as I add parallelism, roughly 75% of my Sqoop jobs fail, and I'm not saying that they don't ingest any data, simply that the data gets stuck in the staging area (I.e /user/username) as opposed to the proper hive table (I.e /user/username/Hive/Lab). Has anyone experienced this before? I figure I may be able to shoot a separate process that moves the hive tables from the staging area into the hive table area, but I'm not sure if that process would simply be to move the tables or if there is more involved. 

Thanks!

Specs: HDP 2.1, Sqoop 1.4.4.2

Cheers,
Jack

Re: Submitting Sqoop jobs in parallel

Posted by Jack Arenas <j...@ckarenas.com>.

Solved by changing Hive metastore to postgresql instead of derby.

On Fri, Mar 6, 2015 at 8:16 AM, Jack Arenas <j...@ckarenas.com> wrote:

> Abe et al,
>
> How do you mean? Isn't that the point of the --hive-table flag? Based on
> the schema add the table to the proper schema.db folder in <path>/Hive/Lab
> for each sqoop job? I'm not sure what you mean... I tried setting
> --target-dir as <path>/Hive/Lab/<schema>.db/<table> and yes it's able to
> ingest the data into HDFS into that folder but hive doesn't recognize that
> the tables are there. It's like the step that actually links the data to
> hive breaks when parallelized.
>
> Hope this info helps.
>
> Best,
> Jack
>
> On Mar 3, 2015, at 8:46 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:
>
> Jack,
>
> Just a thought... but have you tried using --target-dir?
>
> -Abe
>
> On Mon, Mar 2, 2015 at 12:24 PM, Jack Arenas <j...@ckarenas.com> wrote:
>
>> Hi team,
>>
>> I'm building an ETL tool that requires me to pull in a bunch of tables
>> from a db into HDFS and I'm currently doing this sequentially using Sqoop.
>> I figured it might be a faster to submit the Sqoop jobs in parallel, that
>> is with a predefined thread pool (currently trying 8) because it took about
>> two hours to ingest 150 tables of various sizes, frankly not very big
>> tables as this is POC. So sequentially this works fine, but as soon as I
>> add parallelism, roughly 75% of my Sqoop jobs fail, and I'm not saying that
>> they don't ingest any data, simply that the data gets stuck in the staging
>> area (I.e /user/username) as opposed to the proper hive table (I.e
>> /user/username/Hive/Lab). Has anyone experienced this before? I figure I
>> may be able to shoot a separate process that moves the hive tables from the
>> staging area into the hive table area, but I'm not sure if that process
>> would simply be to move the tables or if there is more involved.
>>
>> Thanks!
>>
>> Specs: HDP 2.1, Sqoop 1.4.4.2
>>
>> Cheers,
>> Jack
>>
>>
>


-- 
Jack Arenas
Data Engineer & Web Developer
j@ckarenas.com
+1.805.259.8059
<http://www.linkedin.com/in/jackarenas>

Re: Submitting Sqoop jobs in parallel

Posted by Jack Arenas <j...@ckarenas.com>.

Abe et al,

How do you mean? Isn't that the point of the --hive-table flag? Based on the schema add the table to the proper schema.db folder in <path>/Hive/Lab for each sqoop job? I'm not sure what you mean... I tried setting --target-dir as <path>/Hive/Lab/<schema>.db/<table> and yes it's able to ingest the data into HDFS into that folder but hive doesn't recognize that the tables are there. It's like the step that actually links the data to hive breaks when parallelized.

Hope this info helps.

Best,
Jack

> On Mar 3, 2015, at 8:46 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:
> 
> Jack,
> 
> Just a thought... but have you tried using --target-dir?
> 
> -Abe
> 
>> On Mon, Mar 2, 2015 at 12:24 PM, Jack Arenas <j...@ckarenas.com> wrote:
>> Hi team,
>> 
>> I'm building an ETL tool that requires me to pull in a bunch of tables from a db into HDFS and I'm currently doing this sequentially using Sqoop. I figured it might be a faster to submit the Sqoop jobs in parallel, that is with a predefined thread pool (currently trying 8) because it took about two hours to ingest 150 tables of various sizes, frankly not very big tables as this is POC. So sequentially this works fine, but as soon as I add parallelism, roughly 75% of my Sqoop jobs fail, and I'm not saying that they don't ingest any data, simply that the data gets stuck in the staging area (I.e /user/username) as opposed to the proper hive table (I.e /user/username/Hive/Lab). Has anyone experienced this before? I figure I may be able to shoot a separate process that moves the hive tables from the staging area into the hive table area, but I'm not sure if that process would simply be to move the tables or if there is more involved.
>> 
>> Thanks!
>> 
>> Specs: HDP 2.1, Sqoop 1.4.4.2
>> 
>> Cheers,
>> Jack
>

Re: Submitting Sqoop jobs in parallel

Posted by Abraham Elmahrek <ab...@cloudera.com>.

Jack,

Just a thought... but have you tried using --target-dir?

-Abe

On Mon, Mar 2, 2015 at 12:24 PM, Jack Arenas <j...@ckarenas.com> wrote:

> Hi team,
>
> I'm building an ETL tool that requires me to pull in a bunch of tables
> from a db into HDFS and I'm currently doing this sequentially using Sqoop.
> I figured it might be a faster to submit the Sqoop jobs in parallel, that
> is with a predefined thread pool (currently trying 8) because it took about
> two hours to ingest 150 tables of various sizes, frankly not very big
> tables as this is POC. So sequentially this works fine, but as soon as I
> add parallelism, roughly 75% of my Sqoop jobs fail, and I'm not saying that
> they don't ingest any data, simply that the data gets stuck in the staging
> area (I.e /user/username) as opposed to the proper hive table (I.e
> /user/username/Hive/Lab). Has anyone experienced this before? I figure I
> may be able to shoot a separate process that moves the hive tables from the
> staging area into the hive table area, but I'm not sure if that process
> would simply be to move the tables or if there is more involved.
>
> Thanks!
>
> Specs: HDP 2.1, Sqoop 1.4.4.2
>
> Cheers,
> Jack
>
>