You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by "Markus Kemper (JIRA)" <ji...@apache.org> on 2017/02/06 15:08:41 UTC

[jira] [Commented] (SQOOP-3130) Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers requested

    [ https://issues.apache.org/jira/browse/SQOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854178#comment-15854178 ] 

Markus Kemper commented on SQOOP-3130:
--------------------------------------

From reviewing this issue further another possible workaround is to set the (mapreduce.input.fileinputformat.split.minsize) to help control the count of map tasks used when exporting Avro data files.  This method is not likely to be 100% deterministic if the volume of data is larger that the boundaries of the "minsize" and count of map tasks requested however it is more ideal than one map task per Avro data file.  The example below demonstrates this workaround.

*Use (mapreduce.input.fileinputformat.split.minsize) without \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ','

17/02/06 05:56:25 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:56:25 INFO mapreduce.JobSubmitter: number of splits:4 <=========== uses default --num-mappers value (4)
<SNIP>
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Transferred 3.519 MB in 47.9893 seconds (75.0887 KB/sec)
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}

*Use (mapreduce.input.fileinputformat.split.minsize) with \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728 --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ',' --num-mappers 2

17/02/06 05:59:12 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:59:12 INFO mapreduce.JobSubmitter: number of splits:2 <=========== uses requested --num-mappers value (2) 
<SNIP>
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 32.3481 seconds (111.3925 KB/sec)
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}

> Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers requested
> ---------------------------------------------------------------------------------
>
>                 Key: SQOOP-3130
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3130
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Markus Kemper
>
> When using Sqoop (export + --export-dir + Avro files) it is not obeying --num-mappers requested, instead it is creating a map task per Avro data file.  
> One known workaround for this issue is to use the Sqoop --hcatalog options.
> Please see the test case below demonstrating the issue and workaround.
> *Test Case*
> {noformat}
> #################
> # STEP 01 - Create Data
> #################
> for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; echo "$i,$d,row data" >> ./data.csv; done
> ls -l ./*;
> wc data.csv 
> hdfs dfs -mkdir -p /user/root/external/t1
> hdfs dfs -put ./data.csv /user/root/external/t1/data.csv
> hdfs dfs -ls -R /user/root/external/t1/
> Output:
> -rw-r--r-- 1 root root 3488895 Feb  1 11:20 ./data.csv
> ~~~~~
> 100000  300000 3488895 data.csv
> ~~~~~
> -rw-r--r--   3 root root    3488895 2017-02-01 11:26 /user/root/external/t1/data.csv
> #################
> # STEP 02 - Create RDBMS Table and Export Data
> #################
> export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g;
> export MYUSER=sqoop
> export MYPSWD=cloudera
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1_text (c1 int, c2 date, c3 varchar(10))"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ',' --num-mappers 1
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> Output:
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in 24.8037 seconds (137.3688 KB/sec)
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 03 - Import Data as Text Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir --num-mappers 10 --split-by C1 --as-textfile
> hdfs dfs -ls /user/root/external/t1_text/part*
> Output:
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in 57.0517 seconds (63.1434 KB/sec)
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r--   3 root root     358894 2017-02-01 11:38 /user/root/external/t1_text/part-m-00000
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00001
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00002
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00003
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00004
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00005
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00006
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00007
> -rw-r--r--   3 root root     370000 2017-02-01 11:38 /user/root/external/t1_text/part-m-00008
> -rw-r--r--   3 root root     370001 2017-02-01 11:38 /user/root/external/t1_text/part-m-00009
> #################
> # STEP 04 - Export 10 Text Formatted Data Using 2 Splits
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 0                    | 
> ------------------------
> ~~~~~
> 17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : 10
> 17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2
> <SNIP>
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in 31.7104 seconds (113.6324 KB/sec)
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 05 - Import Data as Avro Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir --num-mappers 10 --split-by C1 --as-avrodatafile
> hdfs dfs -ls /user/root/external/t1_avro/*.avro
> Output:
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in 68.454 seconds (35.4568 KB/sec)
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r--   3 root root     231119 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00000.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00001.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00002.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00003.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00004.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00005.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00006.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00007.avro
> -rw-r--r--   3 root root     250477 2017-02-01 11:57 /user/root/external/t1_avro/part-m-00008.avro
> -rw-r--r--   3 root root     250478 2017-02-01 11:56 /user/root/external/t1_avro/part-m-00009.avro
> #################
> # STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of issue)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> Output:
> 17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : 10
> 17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10 <================== not correct, should be only 2 not 10
> <SNIP>
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in 57.1965 seconds (43.8575 KB/sec)
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> #################
> # STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options (workaround)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use default; drop table t1_avro; create external table t1_avro (c1 int, c2 string, c3 string) row format delimited fields terminated by ',' stored as avro location 'hdfs:///user/root/external/t1_avro'; select count(*) from t1_avro;"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2  
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 0                    | 
> ------------------------
> ~~~~~
> +---------+--+
> |   _c0   |
> +---------+--+
> | 100000  |
> +---------+--+
> ~~~~~
> 17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : 10
> 17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2 <================ correct!
> <SNIP>
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in 48.7286 seconds (53.0082 KB/sec)
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*)             | 
> ------------------------
> | 100000               | 
> ------------------------
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)