You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Joshua Baxter <jo...@gmail.com> on 2015/02/03 16:35:07 UTC

Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.


15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
        at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
        at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
        at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
        at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)


Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?

On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <jo...@gmail.com>
wrote:

> I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
> partitions of a large fact table and getting some great speed. There are
> not any columns i can evenly split by with the default connector but with
> Oraoop I can get evenly sized parquet files that i can use directly in
> impala and hive without incurring remote reads. A couple things i have
> noticed though.
>
>    - Decimal fields are getting exported as strings. SQOOP-1445 refers to
>    this but it sounds like a fix isn't planned due to the HCatalog support.
>    Unfortunately the direct connectors, apart from Netezza, are not currently
>    not supported.
>    - You need to use option -Doraoop.timestamp.string=false otherwise you
>    get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due to
>    the intermediary file format. However the resulting parquet file is a
>    double rather then a hive or impala compatible timestamp.
>
> Here is what i am running now.
>
> sqoop import  -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
> -Doraoop.import.partitions=${PARTITION} \
> --direct \
> --connect jdbc:oracle:thin:@//${DATABASE}  \
> --table "${TABLE}" \
> --columns COL1,COL2,COL3,COL4,COL5,COL6 \
> --map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
> --m 48 \
> --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
> --delete-target-dir
>
> COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
> size of a long so I've remapped those to save space. COL5 is a Decimal and
> COL6 is a DATE. Is there any way I can remap these also so that they are
> written into the parquet file as DECIMAL and timestamp compatible types
> respectively so there isn't a needed to redefine these columns.
>
> Many Thanks
>
> Josh
>

Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Posted by Joshua Baxter <jo...@gmail.com>.
The version of hive is 13.1. Timestamps are working fine in hive with
parquet though. It looks like the fixes have been backported.

http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.0.releasenotes.html


On Wed, Feb 4, 2015 at 3:51 AM, Raviprasad N Pentakota <ra...@in.ibm.com>
wrote:

> Hi,
> There was known issues with the previous versions of hive with parquet &
> timestamp combination. Please check it once which version of hive you are
> using in your cluster.
>
>    Regards,
>    Ravi Prasad Pentakota
>    India Software Lab, IBM Software Group
>    Phone: +9180-43328520  Mobile: 919620959477
>    e-mail:rapentak@in.ibm.com
>
>
>
> [image: Inactive hide details for Joshua Baxter ---02/03/2015 09:05:50
> PM---I've had a little more luck with this after upgrading to CD]Joshua
> Baxter ---02/03/2015 09:05:50 PM---I've had a little more luck with this
> after upgrading to CDH 5.3. The oracle direct connector seems
>
> From: Joshua Baxter <jo...@gmail.com>
> To: user@sqoop.apache.org
> Date: 02/03/2015 09:05 PM
> Subject: Re: --as-parquet-file, Oraoop and Decimal and Timestamp types
> ------------------------------
>
>
>
> I've had a little more luck with this after upgrading to CDH 5.3. The
> oracle direct connector seems to be working well with hcatalog integration
> and the various output file formats. However its seems that parquet doesn't
> work with hcatalog integration. When using "stored as parquet" as the
> --hcatalog-storage-stanza all the mappers are erroring with the below.
>
>
> 15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
> attempt_1422914679712_0003_m_000042_1, Status : FAILED
> Error: java.lang.RuntimeException: Should never be used
>         at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
>         at
> org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
>         at
> org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
>         at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
>
> Anyone had any luck sqooping directly to parquet with Decimal and
> Timestamp types?
>
> On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <*joshuagbaxter@gmail.com*
> <jo...@gmail.com>> wrote:
>
>    I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
>    partitions of a large fact table and getting some great speed. There are
>    not any columns i can evenly split by with the default connector but with
>    Oraoop I can get evenly sized parquet files that i can use directly in
>    impala and hive without incurring remote reads. A couple things i have
>    noticed though.
>       - Decimal fields are getting exported as strings. SQOOP-1445 refers
>       to this but it sounds like a fix isn't planned due to the HCatalog support.
>       Unfortunately the direct connectors, apart from Netezza, are not currently
>       not supported.
>       - You need to use option -Doraoop.timestamp.string=false otherwise
>       you get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due
>       to the intermediary file format. However the resulting parquet file is a
>       double rather then a hive or impala compatible timestamp.
>    Here is what i am running now.
>
>    sqoop import  -Doraoop.chunk.method=ROWID
>    -Doraoop.timestamp.string=false -Doraoop.import.partitions=${PARTITION} \
>    --direct \
>    --connect jdbc:oracle:thin:@//${DATABASE}  \
>    --table "${TABLE}" \
>    --columns COL1,COL2,COL3,COL4,COL5,COL6 \
>    --map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
>    --m 48 \
>    --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
>    --delete-target-dir
>
>    COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a
>    the size of a long so I've remapped those to save space. COL5 is a Decimal
>    and COL6 is a DATE. Is there any way I can remap these also so that they
>    are written into the parquet file as DECIMAL and timestamp compatible types
>    respectively so there isn't a needed to redefine these columns.
>
>    Many Thanks
>
>    Josh
>
>
>

Re: --as-parquet-file, Oraoop and Decimal and Timestamp types

Posted by Raviprasad N Pentakota <ra...@in.ibm.com>.
Hi,
There was known issues with the previous versions of hive with parquet &
timestamp combination. Please check it once which version of hive you are
using in your cluster.
Regards,
Ravi Prasad Pentakota
India Software Lab, IBM Software Group
Phone: +9180-43328520  Mobile: 919620959477
e-mail:rapentak@in.ibm.com





From:	Joshua Baxter <jo...@gmail.com>
To:	user@sqoop.apache.org
Date:	02/03/2015 09:05 PM
Subject:	Re: --as-parquet-file, Oraoop and Decimal and Timestamp types



I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.


15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
        at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter
(MapredParquetOutputFormat.java:79)
        at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter
(FileOutputFormatContainer.java:103)
        at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter
(HCatOutputFormat.java:260)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>
(MapTask.java:644)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs
(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)


Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?

On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <jo...@gmail.com>
wrote:
  I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
  partitions of a large fact table and getting some great speed. There are
  not any columns i can evenly split by with the default connector but with
  Oraoop I can get evenly sized parquet files that i can use directly in
  impala and hive without incurring remote reads. A couple things i have
  noticed though.
        Decimal fields are getting exported as strings. SQOOP-1445 refers
        to this but it sounds like a fix isn't planned due to the HCatalog
        support. Unfortunately the direct connectors, apart from Netezza,
        are not currently not supported.
        You need to use option -Doraoop.timestamp.string=false otherwise
        you get an Not in union ["long","null"]: 2014-07-24 00:00:00
        exception due to the intermediary file format. However the
        resulting parquet file is a double rather then a hive or impala
        compatible timestamp.
  Here is what i am running now.

  sqoop import  -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
  -Doraoop.import.partitions=${PARTITION} \
  --direct \
  --connect jdbc:oracle:thin:@//${DATABASE}  \
  --table "${TABLE}" \
  --columns COL1,COL2,COL3,COL4,COL5,COL6 \
  --map-column-java  COL1=Long,COL2=Long,COL3=Long,COL4=Long \
  --m 48 \
  --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
  --delete-target-dir

  COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
  size of a long so I've remapped those to save space. COL5 is a Decimal
  and COL6 is a DATE. Is there any way I can remap these also so that they
  are written into the parquet file as DECIMAL and timestamp compatible
  types respectively so there isn't a needed to redefine these columns.

  Many Thanks

  Josh