You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Santlal J Gupta <Sa...@bitwiseglobal.com> on 2015/07/17 08:54:36 UTC

Issue while reading Parquet file in Hive

Hello,

I have following issue.

I have created parquet file through cascading parquet  and want to load into the hive table. Parquet file is loaded successfully but when I try to read the file it  gives null instead of actual data. Please find the below code .

package com.parquet.TimestampTest;

import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.pipe.Pipe;
import cascading.scheme.Scheme;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.SinkMode;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
import parquet.cascading.ParquetTupleScheme;

public class GenrateTimeStampParquetFile {
     static String inputPath = "target/input/timestampInputFile";
     static String outputPath = "target/parquetOutput/TimestampOutput";

     public static void main(String[] args) {

           write();
     }

     private static void write() {
           // TODO Auto-generated method stub

           Fields field = new Fields("timestampField").applyTypes(String.class);
           Scheme sourceSch = new TextDelimited(field, true, "\n");

           Fields outputField = new Fields("timestampField");

           Scheme sinkSch = new ParquetTupleScheme(field, outputField,
                     "message TimeStampTest{optional binary timestampField ;}");

           Tap source = new Hfs(sourceSch, inputPath);
           Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);

           Pipe pipe = new Pipe("Hive timestamp");

           FlowDef fd = FlowDef.flowDef().addSource(pipe, source).addTailSink(pipe, sink);

           new HadoopFlowConnector().connect(fd).complete();
     }
}

Input file:

timestampInputFile

timestampField
1988-05-25 15:15:15.254
1987-05-06 14:14:25.362

After running the code following files are generated.
Output :
1. part-00000-m-00000.parquet
2. _SUCCESS
3. _metadata
4. _common_metadata

I have created the table in hive to load the part-00000-m-00000.parquet  file.
File is loaded is successfully but it gives null value while reading.

I have used following command.

hive> create table timestampTest (timestampField timestamp);

hive> load data local inpath '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table timestampTest;
Loading data to table parquet_timestamp_test.timestamptest
Table parquet_timestamp_test.timestamptest stats: [numFiles=1, totalSize=296]
OK
Time taken: 0.508 seconds

hive> select * from timestamptest;
OK
NULL
NULL
NULL
Time taken: 0.104 seconds, Fetched: 3 row(s)

**************************************Disclaimer****************************************** This e-mail message and any attachments may contain confidential information and is for the sole use of the intended recipient(s) only. Any views or opinions presented or implied are solely those of the author and do not necessarily represent the views of BitWise. If you are not the intended recipient(s), you are hereby notified that disclosure, printing, copying, forwarding, distribution, or the taking of any action whatsoever in reliance on the contents of this electronic information is strictly prohibited. If you have received this e-mail message in error, please immediately notify the sender and delete the electronic message and any attachments.BitWise does not accept liability for any virus introduced by this e-mail or any attachments. ********************************************************************************************

Re: Issue while reading Parquet file in Hive

Posted by Daniel Weeks <dw...@netflix.com.INVALID>.
Santlal,

Someone who is more familiar with cascading might be able address options
for writing timestamps, but you're not going to be able to simply write
binary and expect hive to interpret it as a timestamp currently.  The
underlying storage for timestamp is int96 and that's what is expected by
the serde.

You might be able to write it as a string in the proper format (yyyy-mm-dd
hh:mm:ss.f) and use the to_utc_timestamp function to get a timestamp value
out of it, but you'd have to do that every time you query it or ETL from
the cascading output to a hive table with timestamps.

-Dan





On Sun, Jul 19, 2015 at 11:25 PM, Santlal J Gupta <
Santlal.Gupta@bitwiseglobal.com> wrote:

> Hi Daniel,
>
> I am beginner in the cascading parquet.
> As per your guidline I have created table  as by following command.
>
> hive> create table test3(timestampField timestamp) stored as parquet;
> hive> load data local inpath
> '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table test3;
> hive> select  * from test3;
>
> After running above command I got following as output.
>
> Output :
>
> OK
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
>
> Actually i want to create the parquet file through cascading parquet and
> load into the hive.
> And my parquet file contain data, that I want to store into hive column of
> type timestamp.
> But in the cascading parquet there is no timestamp datatype, so for that
> reason  I have taken as Binary. And trying to load into table that have
> field of type timestamp.
> But I have got above exception.
>
> So please help me to solve this problem.
>
> Currently I am using
>     Hive 1.1.0-cdh5.4.2.
>    Cascading 2.5.1
>    parquet-format-2.2.0
>
>
> Thanks,
> Santlal Gutpa
>
>
> -----Original Message-----
> From: Daniel Weeks [mailto:dweeks@netflix.com.INVALID]
> Sent: Friday, July 17, 2015 9:09 PM
> To: dev@parquet.apache.org
> Subject: Re: Issue while reading Parquet file in Hive
>
> Santial,
>
> It might just be as simple as the storage format for your hive table.  I
> notice you say:
>
> hive> create table timestampTest (timestampField timestamp);
>
> But this should be:
>
> hive> create table timestampTest (timestampField timestamp) stored as
> parquet;
>
> Hive is probably processing the file as text.  Please do a 'hive> desc
> formatted timestampTest;' and verify the input/output/serde for the table
> is actually parquet.
>
> -Dan
>
> On Thu, Jul 16, 2015 at 11:54 PM, Santlal J Gupta <
> Santlal.Gupta@bitwiseglobal.com> wrote:
>
> > Hello,
> >
> > I have following issue.
> >
> > I have created parquet file through cascading parquet  and want to
> > load into the hive table. Parquet file is loaded successfully but when
> > I try to read the file it  gives null instead of actual data. Please
> > find the below code .
> >
> > package com.parquet.TimestampTest;
> >
> > import cascading.flow.FlowDef;
> > import cascading.flow.hadoop.HadoopFlowConnector;
> > import cascading.pipe.Pipe;
> > import cascading.scheme.Scheme;
> > import cascading.scheme.hadoop.TextDelimited;
> > import cascading.tap.SinkMode;
> > import cascading.tap.Tap;
> > import cascading.tap.hadoop.Hfs;
> > import cascading.tuple.Fields;
> > import parquet.cascading.ParquetTupleScheme;
> >
> > public class GenrateTimeStampParquetFile {
> >      static String inputPath = "target/input/timestampInputFile";
> >      static String outputPath =
> > "target/parquetOutput/TimestampOutput";
> >
> >      public static void main(String[] args) {
> >
> >            write();
> >      }
> >
> >      private static void write() {
> >            // TODO Auto-generated method stub
> >
> >            Fields field = new
> > Fields("timestampField").applyTypes(String.class);
> >            Scheme sourceSch = new TextDelimited(field, true, "\n");
> >
> >            Fields outputField = new Fields("timestampField");
> >
> >            Scheme sinkSch = new ParquetTupleScheme(field, outputField,
> >                      "message TimeStampTest{optional binary
> > timestampField ;}");
> >
> >            Tap source = new Hfs(sourceSch, inputPath);
> >            Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);
> >
> >            Pipe pipe = new Pipe("Hive timestamp");
> >
> >            FlowDef fd = FlowDef.flowDef().addSource(pipe,
> > source).addTailSink(pipe, sink);
> >
> >            new HadoopFlowConnector().connect(fd).complete();
> >      }
> > }
> >
> > Input file:
> >
> > timestampInputFile
> >
> > timestampField
> > 1988-05-25 15:15:15.254
> > 1987-05-06 14:14:25.362
> >
> > After running the code following files are generated.
> > Output :
> > 1. part-00000-m-00000.parquet
> > 2. _SUCCESS
> > 3. _metadata
> > 4. _common_metadata
> >
> > I have created the table in hive to load the
> > part-00000-m-00000.parquet file.
> > File is loaded is successfully but it gives null value while reading.
> >
> > I have used following command.
> >
> > hive> create table timestampTest (timestampField timestamp);
> >
> > hive> load data local inpath
> > '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table
> > timestampTest; Loading data to table
> > parquet_timestamp_test.timestamptest
> > Table parquet_timestamp_test.timestamptest stats: [numFiles=1,
> > totalSize=296] OK Time taken: 0.508 seconds
> >
> > hive> select * from timestamptest;
> > OK
> > NULL
> > NULL
> > NULL
> > Time taken: 0.104 seconds, Fetched: 3 row(s)
> >
> > **************************************Disclaimer**********************
> > ******************** This e-mail message and any attachments may
> > contain confidential information and is for the sole use of the
> > intended recipient(s) only. Any views or opinions presented or implied
> > are solely those of the author and do not necessarily represent the
> > views of BitWise. If you are not the intended recipient(s), you are
> > hereby notified that disclosure, printing, copying, forwarding,
> > distribution, or the taking of any action whatsoever in reliance on
> > the contents of this electronic information is strictly prohibited. If
> > you have received this e-mail message in error, please immediately
> > notify the sender and delete the electronic message and any
> > attachments.BitWise does not accept liability for any virus introduced
> > by this e-mail or any attachments.
> > **********************************************************************
> > **********************
> >
>

RE: Issue while reading Parquet file in Hive

Posted by Santlal J Gupta <Sa...@bitwiseglobal.com>.
Hi Daniel,

I am beginner in the cascading parquet.
As per your guidline I have created table  as by following command.

hive> create table test3(timestampField timestamp) stored as parquet;
hive> load data local inpath '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table test3;
hive> select  * from test3;

After running above command I got following as output.

Output : 

OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

Actually i want to create the parquet file through cascading parquet and load into the hive. 
And my parquet file contain data, that I want to store into hive column of type timestamp.
But in the cascading parquet there is no timestamp datatype, so for that reason  I have taken as Binary. And trying to load into table that have field of type timestamp. 
But I have got above exception.

So please help me to solve this problem.

Currently I am using 
    Hive 1.1.0-cdh5.4.2.
   Cascading 2.5.1
   parquet-format-2.2.0


Thanks,
Santlal Gutpa


-----Original Message-----
From: Daniel Weeks [mailto:dweeks@netflix.com.INVALID] 
Sent: Friday, July 17, 2015 9:09 PM
To: dev@parquet.apache.org
Subject: Re: Issue while reading Parquet file in Hive

Santial,

It might just be as simple as the storage format for your hive table.  I notice you say:

hive> create table timestampTest (timestampField timestamp);

But this should be:

hive> create table timestampTest (timestampField timestamp) stored as
parquet;

Hive is probably processing the file as text.  Please do a 'hive> desc formatted timestampTest;' and verify the input/output/serde for the table is actually parquet.

-Dan

On Thu, Jul 16, 2015 at 11:54 PM, Santlal J Gupta < Santlal.Gupta@bitwiseglobal.com> wrote:

> Hello,
>
> I have following issue.
>
> I have created parquet file through cascading parquet  and want to 
> load into the hive table. Parquet file is loaded successfully but when 
> I try to read the file it  gives null instead of actual data. Please 
> find the below code .
>
> package com.parquet.TimestampTest;
>
> import cascading.flow.FlowDef;
> import cascading.flow.hadoop.HadoopFlowConnector;
> import cascading.pipe.Pipe;
> import cascading.scheme.Scheme;
> import cascading.scheme.hadoop.TextDelimited;
> import cascading.tap.SinkMode;
> import cascading.tap.Tap;
> import cascading.tap.hadoop.Hfs;
> import cascading.tuple.Fields;
> import parquet.cascading.ParquetTupleScheme;
>
> public class GenrateTimeStampParquetFile {
>      static String inputPath = "target/input/timestampInputFile";
>      static String outputPath = 
> "target/parquetOutput/TimestampOutput";
>
>      public static void main(String[] args) {
>
>            write();
>      }
>
>      private static void write() {
>            // TODO Auto-generated method stub
>
>            Fields field = new
> Fields("timestampField").applyTypes(String.class);
>            Scheme sourceSch = new TextDelimited(field, true, "\n");
>
>            Fields outputField = new Fields("timestampField");
>
>            Scheme sinkSch = new ParquetTupleScheme(field, outputField,
>                      "message TimeStampTest{optional binary 
> timestampField ;}");
>
>            Tap source = new Hfs(sourceSch, inputPath);
>            Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);
>
>            Pipe pipe = new Pipe("Hive timestamp");
>
>            FlowDef fd = FlowDef.flowDef().addSource(pipe, 
> source).addTailSink(pipe, sink);
>
>            new HadoopFlowConnector().connect(fd).complete();
>      }
> }
>
> Input file:
>
> timestampInputFile
>
> timestampField
> 1988-05-25 15:15:15.254
> 1987-05-06 14:14:25.362
>
> After running the code following files are generated.
> Output :
> 1. part-00000-m-00000.parquet
> 2. _SUCCESS
> 3. _metadata
> 4. _common_metadata
>
> I have created the table in hive to load the 
> part-00000-m-00000.parquet file.
> File is loaded is successfully but it gives null value while reading.
>
> I have used following command.
>
> hive> create table timestampTest (timestampField timestamp);
>
> hive> load data local inpath
> '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table 
> timestampTest; Loading data to table 
> parquet_timestamp_test.timestamptest
> Table parquet_timestamp_test.timestamptest stats: [numFiles=1, 
> totalSize=296] OK Time taken: 0.508 seconds
>
> hive> select * from timestamptest;
> OK
> NULL
> NULL
> NULL
> Time taken: 0.104 seconds, Fetched: 3 row(s)
>
> **************************************Disclaimer**********************
> ******************** This e-mail message and any attachments may 
> contain confidential information and is for the sole use of the 
> intended recipient(s) only. Any views or opinions presented or implied 
> are solely those of the author and do not necessarily represent the 
> views of BitWise. If you are not the intended recipient(s), you are 
> hereby notified that disclosure, printing, copying, forwarding, 
> distribution, or the taking of any action whatsoever in reliance on 
> the contents of this electronic information is strictly prohibited. If 
> you have received this e-mail message in error, please immediately 
> notify the sender and delete the electronic message and any 
> attachments.BitWise does not accept liability for any virus introduced 
> by this e-mail or any attachments.
> **********************************************************************
> **********************
>

Re: Issue while reading Parquet file in Hive

Posted by Daniel Weeks <dw...@netflix.com.INVALID>.
Santial,

It might just be as simple as the storage format for your hive table.  I
notice you say:

hive> create table timestampTest (timestampField timestamp);

But this should be:

hive> create table timestampTest (timestampField timestamp) stored as
parquet;

Hive is probably processing the file as text.  Please do a 'hive> desc
formatted timestampTest;' and verify the input/output/serde for the table
is actually parquet.

-Dan

On Thu, Jul 16, 2015 at 11:54 PM, Santlal J Gupta <
Santlal.Gupta@bitwiseglobal.com> wrote:

> Hello,
>
> I have following issue.
>
> I have created parquet file through cascading parquet  and want to load
> into the hive table. Parquet file is loaded successfully but when I try to
> read the file it  gives null instead of actual data. Please find the below
> code .
>
> package com.parquet.TimestampTest;
>
> import cascading.flow.FlowDef;
> import cascading.flow.hadoop.HadoopFlowConnector;
> import cascading.pipe.Pipe;
> import cascading.scheme.Scheme;
> import cascading.scheme.hadoop.TextDelimited;
> import cascading.tap.SinkMode;
> import cascading.tap.Tap;
> import cascading.tap.hadoop.Hfs;
> import cascading.tuple.Fields;
> import parquet.cascading.ParquetTupleScheme;
>
> public class GenrateTimeStampParquetFile {
>      static String inputPath = "target/input/timestampInputFile";
>      static String outputPath = "target/parquetOutput/TimestampOutput";
>
>      public static void main(String[] args) {
>
>            write();
>      }
>
>      private static void write() {
>            // TODO Auto-generated method stub
>
>            Fields field = new
> Fields("timestampField").applyTypes(String.class);
>            Scheme sourceSch = new TextDelimited(field, true, "\n");
>
>            Fields outputField = new Fields("timestampField");
>
>            Scheme sinkSch = new ParquetTupleScheme(field, outputField,
>                      "message TimeStampTest{optional binary timestampField
> ;}");
>
>            Tap source = new Hfs(sourceSch, inputPath);
>            Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);
>
>            Pipe pipe = new Pipe("Hive timestamp");
>
>            FlowDef fd = FlowDef.flowDef().addSource(pipe,
> source).addTailSink(pipe, sink);
>
>            new HadoopFlowConnector().connect(fd).complete();
>      }
> }
>
> Input file:
>
> timestampInputFile
>
> timestampField
> 1988-05-25 15:15:15.254
> 1987-05-06 14:14:25.362
>
> After running the code following files are generated.
> Output :
> 1. part-00000-m-00000.parquet
> 2. _SUCCESS
> 3. _metadata
> 4. _common_metadata
>
> I have created the table in hive to load the part-00000-m-00000.parquet
> file.
> File is loaded is successfully but it gives null value while reading.
>
> I have used following command.
>
> hive> create table timestampTest (timestampField timestamp);
>
> hive> load data local inpath
> '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table
> timestampTest;
> Loading data to table parquet_timestamp_test.timestamptest
> Table parquet_timestamp_test.timestamptest stats: [numFiles=1,
> totalSize=296]
> OK
> Time taken: 0.508 seconds
>
> hive> select * from timestamptest;
> OK
> NULL
> NULL
> NULL
> Time taken: 0.104 seconds, Fetched: 3 row(s)
>
> **************************************Disclaimer******************************************
> This e-mail message and any attachments may contain confidential
> information and is for the sole use of the intended recipient(s) only. Any
> views or opinions presented or implied are solely those of the author and
> do not necessarily represent the views of BitWise. If you are not the
> intended recipient(s), you are hereby notified that disclosure, printing,
> copying, forwarding, distribution, or the taking of any action whatsoever
> in reliance on the contents of this electronic information is strictly
> prohibited. If you have received this e-mail message in error, please
> immediately notify the sender and delete the electronic message and any
> attachments.BitWise does not accept liability for any virus introduced by
> this e-mail or any attachments.
> ********************************************************************************************
>