You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Syed Akram <ak...@zohocorp.com> on 2015/03/12 13:57:02 UTC

Need Suggestions to to sqoop import fastly

Hi,

I am using Sqoop 1.4.5 and i'm doing import from MySQL to Hive 


I'm having a MySQL DBCluster of 200GB data, in which it have 200 db's and in each db it has at least 600 tables(mixture of big and small/empty tables).


When I'm importing big tables, The performance is quite good.


But When i'm trying to do sqoop import  small tables ( i say empty tables with 0 records) is taking at least 20 secs of time for each table.


1.How can i reduce this time for small tables?


my sqoop import query looks like this:


                                             sqoop "import",
 "--connect", uri, 
 "--query", sqlText,
 "--map-column-java", 
 "oprtype=Integer", 
 "--target-dir", targetDir,
 "--hive-import",
 "--hive-table", hiveTable,
 "--username", userName, 
 "--password", password, 
 "--split-by", primaryKey, 
 "--num-mappers","2",
 "--boundary-query",boundaryQry,
 "--hive-overwrite",
 "--class-name",tableName,
 "--outdir", "tmp_sqoop/"+tableName


where "--query" is "select tableName.*, oprtype as 0, modified_time as 0 where $CONDITIONS"


"--split-by" primarykey
"--boundary-query" select min(primarykey), max(primarykey) from table;


This runs fine for big table having even billions of rows.


But for small table, iam noticing constant time taking to do sqoop import.


How do i optimize the things for small tables or tables with 0 records. I want to reduce the latency for small tables.




Please suggest me in this area,




Cheers!!!!







Re: Need Suggestions to to sqoop import fastly

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Syed,

This would depend on your resourcing heavily. How long are the longest
Sqoop jobs taking? If the longest Sqoop jobs are taking 2 days, then you'll
have to increase the number of mappers sqooping data from your database.
You can fiddle with the parallelization by adding the "-m" argument. Check
out
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_controlling_parallelism
for more information.

-Abe

On Fri, Mar 13, 2015 at 7:40 AM, Syed Akram <ak...@zohocorp.com>
wrote:

> Thanks For your valuable response Abe and Martin, This helped me and i
> gave a try on that.
>
> Along with this,
>
> I need one more suggestion,
>
> one db with 600 tables taking 3 hours of time to do hive import using
> below one.
> Like this i have 200 db's and if i run sequentially 200 db's one by one it
> takes 25 days to complete one db cluster.
>
> But this is not the result i am expecting, i want to import the data (all
> the 200 db's) in 2 days(at most). How to make this happen,
>
> i.e., one DB Cluster with more than or equal to 200 db's should be done as
> soon as possible.
>
> Suggestions on this. I'm working on this for long time. suggestions are
> valuable.
>
>
> Cheers!!!!
>
> ---- On Thu, 12 Mar 2015 23:29:00 +0530 *Juan Martin
> Pampliega<jpampliega@gmail.com <jp...@gmail.com>>* wrote ----
>
> If you are using InnoDB you can use something like:
>
> SELECT table_rows FROM INFORMATION_SCHEMA.TABLES
>   WHERE table_schema = 'db_name' AND table_name LIKE 'name_of_table';
>
> On Thu, Mar 12, 2015 at 2:53 PM, Abraham Elmahrek <ab...@cloudera.com>
> wrote:
>
> Hey Syed,
>
> Sqoop has to boot a MR job in order to do the data transfer. This takes
> some time. As such, would the following work?
>
> #!/bin/bash
>
> [[ $( mysql test -e "SELECT COUNT(*) FROM test" | tail -1 ) -gt 0 ]] &&
> sqoop import ...
> The COUNT statement should be lightning fast if you're using MyISAM as
> your storage engine.
>
> -Abe
>
> On Thu, Mar 12, 2015 at 5:57 AM, Syed Akram <ak...@zohocorp.com>
> wrote:
>
>
> Hi,
>
> I am using Sqoop 1.4.5 and i'm doing import from MySQL to Hive
>
> I'm having a MySQL DBCluster of 200GB data, in which it have 200 db's and
> in each db it has at least 600 tables(mixture of big and small/empty
> tables).
>
> When I'm importing big tables, The performance is quite good.
>
> But When i'm trying to do sqoop import  small tables ( i say empty tables
> with 0 records) is taking at least 20 secs of time for each table.
>
> *1.How can i reduce this time for small tables?*
>
> *my sqoop import query looks like this:*
>
>
>
>
>
>
>
>
>
>
> *                                             sqoop "import", "--connect",
> uri,  "--query", sqlText, "--map-column-java",  "oprtype=Integer",
> "--target-dir", targetDir, "--hive-import", "--hive-table", hiveTable,
> "--username", userName,  "--password", password,  "--split-by",
> primaryKey,  "--num-mappers","2", "--boundary-query",boundaryQry,
> "--hive-overwrite", "--class-name",tableName, "--outdir",
> "tmp_sqoop/"+tableNamewhere "--query" is "select tableName.*, oprtype as 0,
> modified_time as 0 where $CONDITIONS""--split-by"
> primarykey"--boundary-query" select min(primarykey), max(primarykey) from
> table;This runs fine for big table having even billions of rows.But for
> small table, iam noticing constant time taking to do sqoop import.How do i
> optimize the things for small tables or tables with 0 records. I want to
> reduce the latency for small tables.Please suggest me in this
> area,Cheers!!!!*
>
>
>
>
>
>
>

Re: Need Suggestions to to sqoop import fastly

Posted by Syed Akram <ak...@zohocorp.com>.
Thanks For your valuable response Abe and Martin, This helped me and i gave a try on that.

Along with this, 


I need one more suggestion,


one db with 600 tables taking 3 hours of time to do hive import using below one.
Like this i have 200 db's and if i run sequentially 200 db's one by one it takes 25 days to complete one db cluster.


But this is not the result i am expecting, i want to import the data (all the 200 db's) in 2 days(at most). How to make this happen,


i.e., one DB Cluster with more than or equal to 200 db's should be done as soon as possible. 


Suggestions on this. I'm working on this for long time. suggestions are valuable. 




Cheers!!!!

---- On Thu, 12 Mar 2015 23:29:00 +0530 Juan Martin Pampliega&lt;jpampliega@gmail.com&gt; wrote ---- 

If you are using InnoDB you can use something like:

SELECT table_rows FROM INFORMATION_SCHEMA.TABLES
  WHERE table_schema = 'db_name' AND table_name LIKE 'name_of_table';


On Thu, Mar 12, 2015 at 2:53 PM, Abraham Elmahrek &lt;abe@cloudera.com&gt; wrote:
Hey Syed,


Sqoop has to boot a MR job in order to do the data transfer. This takes some time. As such, would the following work?


#!/bin/bash
 [[ $( mysql test -e "SELECT COUNT(*) FROM test" | tail -1 ) -gt 0 ]] &amp;&amp; sqoop import ...

The COUNT statement should be lightning fast if you're using MyISAM as your storage engine.


-Abe

On Thu, Mar 12, 2015 at 5:57 AM, Syed Akram &lt;akram.basha@zohocorp.com&gt; wrote:

Hi,

I am using Sqoop 1.4.5 and i'm doing import from MySQL to Hive 


I'm having a MySQL DBCluster of 200GB data, in which it have 200 db's and in each db it has at least 600 tables(mixture of big and small/empty tables).


When I'm importing big tables, The performance is quite good.


But When i'm trying to do sqoop import  small tables ( i say empty tables with 0 records) is taking at least 20 secs of time for each table.


1.How can i reduce this time for small tables?


my sqoop import query looks like this:


                                             sqoop "import",
 "--connect", uri, 
 "--query", sqlText,
 "--map-column-java", 
 "oprtype=Integer", 
 "--target-dir", targetDir,
 "--hive-import",
 "--hive-table", hiveTable,
 "--username", userName, 
 "--password", password, 
 "--split-by", primaryKey, 
 "--num-mappers","2",
 "--boundary-query",boundaryQry,
 "--hive-overwrite",
 "--class-name",tableName,
 "--outdir", "tmp_sqoop/"+tableName


where "--query" is "select tableName.*, oprtype as 0, modified_time as 0 where $CONDITIONS"


"--split-by" primarykey
"--boundary-query" select min(primarykey), max(primarykey) from table;


This runs fine for big table having even billions of rows.


But for small table, iam noticing constant time taking to do sqoop import.


How do i optimize the things for small tables or tables with 0 records. I want to reduce the latency for small tables.




Please suggest me in this area,




Cheers!!!!













 


 






Re: Need Suggestions to to sqoop import fastly

Posted by Juan Martin Pampliega <jp...@gmail.com>.
If you are using InnoDB you can use something like:

SELECT table_rows FROM INFORMATION_SCHEMA.TABLES
  WHERE table_schema = 'db_name' AND table_name LIKE 'name_of_table';

On Thu, Mar 12, 2015 at 2:53 PM, Abraham Elmahrek <ab...@cloudera.com> wrote:

> Hey Syed,
>
> Sqoop has to boot a MR job in order to do the data transfer. This takes
> some time. As such, would the following work?
>
> #!/bin/bash
>
> [[ $( mysql test -e "SELECT COUNT(*) FROM test" | tail -1 ) -gt 0 ]] &&
> sqoop import ...
> The COUNT statement should be lightning fast if you're using MyISAM as
> your storage engine.
>
> -Abe
>
> On Thu, Mar 12, 2015 at 5:57 AM, Syed Akram <ak...@zohocorp.com>
> wrote:
>
>>
>> Hi,
>>
>> I am using Sqoop 1.4.5 and i'm doing import from MySQL to Hive
>>
>> I'm having a MySQL DBCluster of 200GB data, in which it have 200 db's and
>> in each db it has at least 600 tables(mixture of big and small/empty
>> tables).
>>
>> When I'm importing big tables, The performance is quite good.
>>
>> But When i'm trying to do sqoop import  small tables ( i say empty tables
>> with 0 records) is taking at least 20 secs of time for each table.
>>
>> *1.How can i reduce this time for small tables?*
>>
>> *my sqoop import query looks like this:*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *                                             sqoop "import",
>> "--connect", uri,  "--query", sqlText, "--map-column-java",
>> "oprtype=Integer",  "--target-dir", targetDir, "--hive-import",
>> "--hive-table", hiveTable, "--username", userName,  "--password",
>> password,  "--split-by", primaryKey,  "--num-mappers","2",
>> "--boundary-query",boundaryQry, "--hive-overwrite",
>> "--class-name",tableName, "--outdir", "tmp_sqoop/"+tableNamewhere "--query"
>> is "select tableName.*, oprtype as 0, modified_time as 0 where
>> $CONDITIONS""--split-by" primarykey"--boundary-query" select
>> min(primarykey), max(primarykey) from table;This runs fine for big table
>> having even billions of rows.But for small table, iam noticing constant
>> time taking to do sqoop import.How do i optimize the things for small
>> tables or tables with 0 records. I want to reduce the latency for small
>> tables.Please suggest me in this area,Cheers!!!!*
>>
>>
>>
>

Re: Need Suggestions to to sqoop import fastly

Posted by Abraham Elmahrek <ab...@cloudera.com>.
Hey Syed,

Sqoop has to boot a MR job in order to do the data transfer. This takes
some time. As such, would the following work?

#!/bin/bash

[[ $( mysql test -e "SELECT COUNT(*) FROM test" | tail -1 ) -gt 0 ]] &&
sqoop import ...
The COUNT statement should be lightning fast if you're using MyISAM as your
storage engine.

-Abe

On Thu, Mar 12, 2015 at 5:57 AM, Syed Akram <ak...@zohocorp.com>
wrote:

>
> Hi,
>
> I am using Sqoop 1.4.5 and i'm doing import from MySQL to Hive
>
> I'm having a MySQL DBCluster of 200GB data, in which it have 200 db's and
> in each db it has at least 600 tables(mixture of big and small/empty
> tables).
>
> When I'm importing big tables, The performance is quite good.
>
> But When i'm trying to do sqoop import  small tables ( i say empty tables
> with 0 records) is taking at least 20 secs of time for each table.
>
> *1.How can i reduce this time for small tables?*
>
> *my sqoop import query looks like this:*
>
>
>
>
>
>
>
>
>
>
> *                                             sqoop "import", "--connect",
> uri,  "--query", sqlText, "--map-column-java",  "oprtype=Integer",
> "--target-dir", targetDir, "--hive-import", "--hive-table", hiveTable,
> "--username", userName,  "--password", password,  "--split-by",
> primaryKey,  "--num-mappers","2", "--boundary-query",boundaryQry,
> "--hive-overwrite", "--class-name",tableName, "--outdir",
> "tmp_sqoop/"+tableNamewhere "--query" is "select tableName.*, oprtype as 0,
> modified_time as 0 where $CONDITIONS""--split-by"
> primarykey"--boundary-query" select min(primarykey), max(primarykey) from
> table;This runs fine for big table having even billions of rows.But for
> small table, iam noticing constant time taking to do sqoop import.How do i
> optimize the things for small tables or tables with 0 records. I want to
> reduce the latency for small tables.Please suggest me in this
> area,Cheers!!!!*
>
>
>