You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Joshua Baxter <jo...@gmail.com> on 2015/02/03 16:35:07 UTC
Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.
15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?
On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <jo...@gmail.com>
wrote:
> I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
> partitions of a large fact table and getting some great speed. There are
> not any columns i can evenly split by with the default connector but with
> Oraoop I can get evenly sized parquet files that i can use directly in
> impala and hive without incurring remote reads. A couple things i have
> noticed though.
>
> - Decimal fields are getting exported as strings. SQOOP-1445 refers to
> this but it sounds like a fix isn't planned due to the HCatalog support.
> Unfortunately the direct connectors, apart from Netezza, are not currently
> not supported.
> - You need to use option -Doraoop.timestamp.string=false otherwise you
> get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due to
> the intermediary file format. However the resulting parquet file is a
> double rather then a hive or impala compatible timestamp.
>
> Here is what i am running now.
>
> sqoop import -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
> -Doraoop.import.partitions=${PARTITION} \
> --direct \
> --connect jdbc:oracle:thin:@//${DATABASE} \
> --table "${TABLE}" \
> --columns COL1,COL2,COL3,COL4,COL5,COL6 \
> --map-column-java COL1=Long,COL2=Long,COL3=Long,COL4=Long \
> --m 48 \
> --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
> --delete-target-dir
>
> COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
> size of a long so I've remapped those to save space. COL5 is a Decimal and
> COL6 is a DATE. Is there any way I can remap these also so that they are
> written into the parquet file as DECIMAL and timestamp compatible types
> respectively so there isn't a needed to redefine these columns.
>
> Many Thanks
>
> Josh
>
Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
Posted by Joshua Baxter <jo...@gmail.com>.
The version of hive is 13.1. Timestamps are working fine in hive with
parquet though. It looks like the fixes have been backported.
http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.0.releasenotes.html
On Wed, Feb 4, 2015 at 3:51 AM, Raviprasad N Pentakota <ra...@in.ibm.com>
wrote:
> Hi,
> There was known issues with the previous versions of hive with parquet &
> timestamp combination. Please check it once which version of hive you are
> using in your cluster.
>
> Regards,
> Ravi Prasad Pentakota
> India Software Lab, IBM Software Group
> Phone: +9180-43328520 Mobile: 919620959477
> e-mail:rapentak@in.ibm.com
>
>
>
> [image: Inactive hide details for Joshua Baxter ---02/03/2015 09:05:50
> PM---I've had a little more luck with this after upgrading to CD]Joshua
> Baxter ---02/03/2015 09:05:50 PM---I've had a little more luck with this
> after upgrading to CDH 5.3. The oracle direct connector seems
>
> From: Joshua Baxter <jo...@gmail.com>
> To: user@sqoop.apache.org
> Date: 02/03/2015 09:05 PM
> Subject: Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
> ------------------------------
>
>
>
> I've had a little more luck with this after upgrading to CDH 5.3. The
> oracle direct connector seems to be working well with hcatalog integration
> and the various output file formats. However its seems that parquet doesn't
> work with hcatalog integration. When using "stored as parquet" as the
> --hcatalog-storage-stanza all the mappers are erroring with the below.
>
>
> 15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
> attempt_1422914679712_0003_m_000042_1, Status : FAILED
> Error: java.lang.RuntimeException: Should never be used
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
> at
> org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
> at
> org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
> at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
>
> Anyone had any luck sqooping directly to parquet with Decimal and
> Timestamp types?
>
> On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <*joshuagbaxter@gmail.com*
> <jo...@gmail.com>> wrote:
>
> I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
> partitions of a large fact table and getting some great speed. There are
> not any columns i can evenly split by with the default connector but with
> Oraoop I can get evenly sized parquet files that i can use directly in
> impala and hive without incurring remote reads. A couple things i have
> noticed though.
> - Decimal fields are getting exported as strings. SQOOP-1445 refers
> to this but it sounds like a fix isn't planned due to the HCatalog support.
> Unfortunately the direct connectors, apart from Netezza, are not currently
> not supported.
> - You need to use option -Doraoop.timestamp.string=false otherwise
> you get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due
> to the intermediary file format. However the resulting parquet file is a
> double rather then a hive or impala compatible timestamp.
> Here is what i am running now.
>
> sqoop import -Doraoop.chunk.method=ROWID
> -Doraoop.timestamp.string=false -Doraoop.import.partitions=${PARTITION} \
> --direct \
> --connect jdbc:oracle:thin:@//${DATABASE} \
> --table "${TABLE}" \
> --columns COL1,COL2,COL3,COL4,COL5,COL6 \
> --map-column-java COL1=Long,COL2=Long,COL3=Long,COL4=Long \
> --m 48 \
> --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
> --delete-target-dir
>
> COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a
> the size of a long so I've remapped those to save space. COL5 is a Decimal
> and COL6 is a DATE. Is there any way I can remap these also so that they
> are written into the parquet file as DECIMAL and timestamp compatible types
> respectively so there isn't a needed to redefine these columns.
>
> Many Thanks
>
> Josh
>
>
>
Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
Posted by Raviprasad N Pentakota <ra...@in.ibm.com>.
Hi,
There was known issues with the previous versions of hive with parquet &
timestamp combination. Please check it once which version of hive you are
using in your cluster.
Regards,
Ravi Prasad Pentakota
India Software Lab, IBM Software Group
Phone: +9180-43328520 Mobile: 919620959477
e-mail:rapentak@in.ibm.com
From: Joshua Baxter <jo...@gmail.com>
To: user@sqoop.apache.org
Date: 02/03/2015 09:05 PM
Subject: Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.
15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter
(MapredParquetOutputFormat.java:79)
at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter
(FileOutputFormatContainer.java:103)
at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter
(HCatOutputFormat.java:260)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>
(MapTask.java:644)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs
(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?
On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <jo...@gmail.com>
wrote:
I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
partitions of a large fact table and getting some great speed. There are
not any columns i can evenly split by with the default connector but with
Oraoop I can get evenly sized parquet files that i can use directly in
impala and hive without incurring remote reads. A couple things i have
noticed though.
Decimal fields are getting exported as strings. SQOOP-1445 refers
to this but it sounds like a fix isn't planned due to the HCatalog
support. Unfortunately the direct connectors, apart from Netezza,
are not currently not supported.
You need to use option -Doraoop.timestamp.string=false otherwise
you get an Not in union ["long","null"]: 2014-07-24 00:00:00
exception due to the intermediary file format. However the
resulting parquet file is a double rather then a hive or impala
compatible timestamp.
Here is what i am running now.
sqoop import -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
-Doraoop.import.partitions=${PARTITION} \
--direct \
--connect jdbc:oracle:thin:@//${DATABASE} \
--table "${TABLE}" \
--columns COL1,COL2,COL3,COL4,COL5,COL6 \
--map-column-java COL1=Long,COL2=Long,COL3=Long,COL4=Long \
--m 48 \
--target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
--delete-target-dir
COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
size of a long so I've remapped those to save space. COL5 is a Decimal
and COL6 is a DATE. Is there any way I can remap these also so that they
are written into the parquet file as DECIMAL and timestamp compatible
types respectively so there isn't a needed to redefine these columns.
Many Thanks
Josh