You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Reed Villanueva <rv...@ucera.org> on 2020/02/03 10:05:37 UTC

How does sqoop determine column types when using the --as-parquetfile option?

How does sqoop

20/02/02 19:38:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.3.1.0.0-78
Sqoop 1.4.7.3.1.0.0-78
git commit id 3d9e99efab5d06beac8f5a9506ac2576619e15f9
Compiled by jenkins on Thu Dec  6 12:26:56 UTC 2018

determine schema column types when using the --as-parquetfile option?

Importing table data from an oracle db via sqoop and using the
--as-parquetfile option...

sqoop import \
          -Dmapreduce.map.memory.mb=3144 -Dmapreduce.map.java.opts=-Xmx1048m \
          -Dyarn.app.mapreduce.am.log.level=DEBUG \
          -Dmapreduce.map.log.level=DEBUG \
          -Dmapreduce.reduce.log.level=DEBUG \
          -Dmapred.job.name="Ora import table $tablename" \
          -Djava.security.egd=file:///dev/urandom \
          -Djava.security.egd=file:///dev/urandom \
          -Doraoop.timestamp.string=false \
          -Dmapreduce.map.max.attempts=10 \
          -Dmapreduce.task.timeout=1500000 \
          -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
          --connect $DBCNXN --username $DBUSER --password $DBPASSWORD \
          --as-parquetfile \
          --target-dir $importdir \
          -query "$sqoop_query" \
          --split-by $splitby \
          --where "1=1" \
          --num-mappers 12 \
          --class-name "QueryResult_$tablename" \
          --delete-target-dir

Noticing that when importing the whole table, eg...

    --query "select * from mytable"

vs just some smaller slice, eg...

    --query "select * from mytable where post_date >= CURRENT_DATE"

some of the column types in the parquet schemas appear to be different
(when putting both parquets in a folder and trying to read via pyspark).
Mainly the date columns in the bulk imported version have *string* types
while the date types in the parquets from conditional sqoop query appear to
have most of the date types as *long*, eg...

Caused by: org.apache.spark.SparkException: Failed to merge fields
'SERVICE_DATE' and 'SERVICE_DATE'. Failed to merge incompatible data types
StringType and LongType

Anyone know what could be going on here? Any further debugging advice?

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.