You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Remus Rusanu <re...@microsoft.com> on 2014/02/05 15:27:06 UTC
How is the STORED AS PARQUET used?
Hello all,
I tried the following on a build that has the latest HIVE-5783 patch applied over trunk:
hive> set hive.aux.jars.path=file:///usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,file:///usr/lib/hive/lib/parquet-hadoop-bundle-1.3.2.jar;
hive> create table alltypes_parquet stored as parquet as select cint, ctinyint, csmallint, cdouble, cfloat, cstring1 from alltypesorc;
hive> show create table alltypes_parquet;
OK
CREATE TABLE `alltypes_parquet`(
`cint` int COMMENT 'from deserializer',
`ctinyint` tinyint COMMENT 'from deserializer',
`csmallint` smallint COMMENT 'from deserializer',
`cdouble` double COMMENT 'from deserializer',
`cfloat` float COMMENT 'from deserializer',
`cstring1` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/alltypes_parquet'
TBLPROPERTIES (
'numFiles'='1',
'transient_lastDdlTime'='1391609238',
'COLUMN_STATS_ACCURATE'='true',
'totalSize'='256959',
'numRows'='12288',
'rawDataSize'='73728')
Time taken: 0.256 seconds, Fetched: 22 row(s)
hive> select * from alltypes_parquet where 1=1;
...
Error:
Caused by: parquet.io.InvalidRecordException: cint not found in message table_schema {
}
at parquet.schema.GroupType.getFieldIndex(GroupType.java:104)
at parquet.schema.GroupType.getType(GroupType.java:136)
at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:93)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:205)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:79)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
So what am I missing? The catalog info seems at odds with the record structure after CREATE TABLE.
Thanks,
~Remus
PS. alltypesorc is the test ORC table based on data from <enlistment>\data\files\alltypesorc
RE: How is the STORED AS PARQUET used?
Posted by Remus Rusanu <re...@microsoft.com>.
Thanks. I used INSERT ... SELECT instead and works fine.
-----Original Message-----
From: Brock Noland [mailto:brock@cloudera.com]
Sent: Wednesday, February 05, 2014 4:46 PM
To: dev@hive.apache.org
Subject: Re: How is the STORED AS PARQUET used?
Hi,
CTAS needs to be implemented for Parquet + Hive. There are more details here: https://issues.apache.org/jira/browse/HIVE-6375
For a basic guide, I'd look at the following files in the patch:
parquet_partitioned.q and parquet_create.q
I have working on the Parquet documentation on my calendar for Thursday/Friday.
Brock
On Wed, Feb 5, 2014 at 8:27 AM, Remus Rusanu <re...@microsoft.com> wrote:
> Hello all,
>
> I tried the following on a build that has the latest HIVE-5783 patch applied over trunk:
>
> hive> set
> hive> hive.aux.jars.path=file:///usr/lib/hcatalog/share/hcatalog/hcata
> hive> log-core.jar,file:///usr/lib/hive/lib/parquet-hadoop-bundle-1.3.
> hive> 2.jar; create table alltypes_parquet stored as parquet as select
> hive> cint, ctinyint, csmallint, cdouble, cfloat, cstring1 from
> hive> alltypesorc; show create table alltypes_parquet;
> OK
> CREATE TABLE `alltypes_parquet`(
> `cint` int COMMENT 'from deserializer',
> `ctinyint` tinyint COMMENT 'from deserializer',
> `csmallint` smallint COMMENT 'from deserializer',
> `cdouble` double COMMENT 'from deserializer',
> `cfloat` float COMMENT 'from deserializer',
> `cstring1` string COMMENT 'from deserializer') ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/alltypes_parquet'
> TBLPROPERTIES (
> 'numFiles'='1',
> 'transient_lastDdlTime'='1391609238',
> 'COLUMN_STATS_ACCURATE'='true',
> 'totalSize'='256959',
> 'numRows'='12288',
> 'rawDataSize'='73728')
> Time taken: 0.256 seconds, Fetched: 22 row(s)
>
> hive> select * from alltypes_parquet where 1=1;
> ...
> Error:
> Caused by: parquet.io.InvalidRecordException: cint not found in
> message table_schema { }
> at parquet.schema.GroupType.getFieldIndex(GroupType.java:104)
> at parquet.schema.GroupType.getType(GroupType.java:136)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:93)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:205)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:79)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
> at
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiv
> eRecordReader.java:65)
>
> So what am I missing? The catalog info seems at odds with the record structure after CREATE TABLE.
>
> Thanks,
> ~Remus
>
> PS. alltypesorc is the test ORC table based on data from
> <enlistment>\data\files\alltypesorc
--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
Re: How is the STORED AS PARQUET used?
Posted by Brock Noland <br...@cloudera.com>.
Hi,
CTAS needs to be implemented for Parquet + Hive. There are more
details here: https://issues.apache.org/jira/browse/HIVE-6375
For a basic guide, I'd look at the following files in the patch:
parquet_partitioned.q and parquet_create.q
I have working on the Parquet documentation on my calendar for Thursday/Friday.
Brock
On Wed, Feb 5, 2014 at 8:27 AM, Remus Rusanu <re...@microsoft.com> wrote:
> Hello all,
>
> I tried the following on a build that has the latest HIVE-5783 patch applied over trunk:
>
> hive> set hive.aux.jars.path=file:///usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,file:///usr/lib/hive/lib/parquet-hadoop-bundle-1.3.2.jar;
> hive> create table alltypes_parquet stored as parquet as select cint, ctinyint, csmallint, cdouble, cfloat, cstring1 from alltypesorc;
> hive> show create table alltypes_parquet;
> OK
> CREATE TABLE `alltypes_parquet`(
> `cint` int COMMENT 'from deserializer',
> `ctinyint` tinyint COMMENT 'from deserializer',
> `csmallint` smallint COMMENT 'from deserializer',
> `cdouble` double COMMENT 'from deserializer',
> `cfloat` float COMMENT 'from deserializer',
> `cstring1` string COMMENT 'from deserializer')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/alltypes_parquet'
> TBLPROPERTIES (
> 'numFiles'='1',
> 'transient_lastDdlTime'='1391609238',
> 'COLUMN_STATS_ACCURATE'='true',
> 'totalSize'='256959',
> 'numRows'='12288',
> 'rawDataSize'='73728')
> Time taken: 0.256 seconds, Fetched: 22 row(s)
>
> hive> select * from alltypes_parquet where 1=1;
> ...
> Error:
> Caused by: parquet.io.InvalidRecordException: cint not found in message table_schema {
> }
> at parquet.schema.GroupType.getFieldIndex(GroupType.java:104)
> at parquet.schema.GroupType.getType(GroupType.java:136)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:93)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:205)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:79)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
> at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
>
> So what am I missing? The catalog info seems at odds with the record structure after CREATE TABLE.
>
> Thanks,
> ~Remus
>
> PS. alltypesorc is the test ORC table based on data from <enlistment>\data\files\alltypesorc
--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org