You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Ravi Tatapudi <ra...@in.ibm.com> on 2016/03/09 14:09:06 UTC
How to write "date, timestamp, decimal" data to Parquet-files
Hello,
I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
that writes data to Parquet-files, which can be imported into hive-tables.
Pl. find attached sample-program, which writes simple parquet-data-file:
Using the above program, I could create "parquet-files" with data-types:
INT, LONG, STRING, Boolean...etc (i.e., basically all data-types supported
by "org.apache.avro.Schema.Type) & load it into "hive" tables
successfully.
Now, I am trying to figure out, how to write "date, timestamp, decimal
data" into parquet-files. In this context, I request you provide the
possible options (and/or sample-program, if any..), in this regard.
Thanks,
Ravi
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Many thanks for the inputs & confirmation.
Do you have any idea on, when "parquet-avro-1.9.0" would be released (any
tentative release-date / month or Q2/Q3 2016) ? Could you please let me
know, so that I can plan accordingly.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/14/2016 09:56 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Ravi,
Support for those types in parquet-avro hasn't been committed yet. It's
implemented in the branch I pointed you to. If you want to use released
versions, it should be out in 1.9.0.
rb
On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Thanks for the inputs.
>
> I am building & running the test-application, primarily using the
> following JAR-files (for Avro, Parquet-Avro & Hive APIs):
>
> 1) avro-1.8.0.jar
> 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> maven-repository-URL:
> http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> 3) hive-exec-1.2.1.jar
>
> Am I supposed to build/run the test, using a different version of the
> JAR-files ? Could you please let me know.
>
> Thanks,
> Ravi
>
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/11/2016 10:54 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Yes, it is supported in 1.2.1. It went in here:
>
>
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
>
> Are you using a version of Parquet with that pull request in it? Also,
if
> you're using CDH this may not work.
>
> rb
>
> On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > I am using hive-version: 1.2.1, as indicated below:
> >
> > --------------------------------------
> > $ hive --version
> > Hive 1.2.1
> > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > From source with checksum ab480aca41b24a9c3751b8c023338231
> > $
> > --------------------------------------
> >
> > As I understand, this version of "hive" supports "date" datatype.
right
> ?.
> > Do you want me to re-test using any other higher-version of hive ? Pl.
> let
> > me know your thoughts.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 06:18 AM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > What version of Hive are you using? You should make sure date is
> supported
> > there.
> >
> > rb
> >
> > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Many thanks for the reply. I see that, the text-attachment
containing
> my
> > > test-program is not sent to the mail-group, but got filtered out.
> Hence,
> > > copying the program-code below:
> > >
> > > =================================================================
> > > import java.io.IOException;
> > > import java.util.*;
> > > import org.apache.hadoop.conf.Configuration;
> > > import org.apache.hadoop.fs.FileSystem;
> > > import org.apache.hadoop.fs.Path;
> > > import org.apache.avro.Schema;
> > > import org.apache.avro.Schema.Type;
> > > import org.apache.avro.Schema.Field;
> > > import org.apache.avro.generic.* ;
> > > import org.apache.avro.LogicalTypes;
> > > import org.apache.avro.LogicalTypes.*;
> > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > import parquet.avro.*;
> > >
> > > public class pqtw {
> > >
> > > public static Schema makeSchema() {
> > > List<Field> fields = new ArrayList<Field>();
> > > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > > null));
> > > fields.add(new Field("age", Schema.create(Type.INT), null,
> null));
> > >
> > > Schema date =
> > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > fields.add(new Field("doj", date, null, null));
> > >
> > > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > > false);
> > > schema.setFields(fields);
> > >
> > > return(schema);
> > > }
> > >
> > > public static GenericData.Record makeRecord (Schema schema, String
> name,
> > > int age, int doj) {
> > > GenericData.Record record = new GenericData.Record(schema);
> > > record.put("name", name);
> > > record.put("age", age);
> > > record.put("doj", doj);
> > > return(record);
> > > }
> > >
> > > public static void main(String[] args) throws IOException,
> > >
> > > InterruptedException, ClassNotFoundException {
> > >
> > > String pqfile = "/tmp/pqtfile1";
> > >
> > > try {
> > >
> > > Configuration conf = new Configuration();
> > > FileSystem fs = FileSystem.getLocal(conf);
> > >
> > > Schema schema = makeSchema() ;
> > > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000)
;
> > > AvroParquetWriter writer = new AvroParquetWriter(new
> > Path(pqfile),
> > > schema);
> > > writer.write(rec);
> > > writer.close();
> > > }
> > > catch (Exception e)
> > > {
> > > e.printStackTrace();
> > > }
> > > }
> > > }
> > > =================================================================
> > >
> > > With the above logic, I could write the data to parquet-file.
However,
> > > when I load the same into a hive-table & select columns, I could
> select
> > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
successfully,
> > but
> > > select of "date" column failed with the error given below:
> > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED
AS
> > > PARQUET ;
> > > OK
> > > Time taken: 0.369 seconds
> > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > hive> SELECT name,age from PT1;
> > > OK
> > > abcd 21
> > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > hive> SELECT doj from PT1;
> > > OK
> > > Failed with exception
> > >
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
cannot
> be
> > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > Time taken: 0.167 seconds
> > > hive>
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > >
> > > Basically, for "date datatype", I am trying to pass an integer-value
> > (for
> > > the # of days from Unix epoch, 1 January 1970, so that the date
falls
> > > somewhere around 2011..etc). Is this the correct approach to process
> > date
> > > data (or is there any other approach / API to do it) ? Could you
> please
> > > let me know your inputs, in this regard ?
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/09/2016 10:48 PM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Hi Ravi,
> > >
> > > Not all of the types are fully-implemented yet. I think Hive only
has
> > > partial support. If I remember correctly:
> > > * Decimal is supported if the backing primitive type is fixed-length
> > > binary
> > > * Date and Timestamp are supported, but Time has not been
implemented
> > yet
> > >
> > > For object models you can build applications on (instead of those
> > embedded
> > > in SQL), only Avro objects can support those types through its
> > > LogicalTypes
> > > API. That API has been implemented in parquet-avro, but not yet
> > committed.
> > > I would like for this feature to make it into 1.9.0. If you want to
> test
> > > in
> > > the mean time, check out the pull request:
> > >
> > > https://github.com/apache/parquet-mr/pull/318
> > >
> > > rb
> > >
> > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > test-tool,
> > > > that writes data to Parquet-files, which can be imported into
> > > hive-tables.
> > > > Pl. find attached sample-program, which writes simple
> > parquet-data-file:
> > > >
> > > >
> > > >
> > > > Using the above program, I could create "parquet-files" with
> > data-types:
> > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > supported
> > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > successfully.
> > > >
> > > > Now, I am trying to figure out, how to write "date, timestamp,
> decimal
> > > > data" into parquet-files. In this context, I request you provide
> the
> > > > possible options (and/or sample-program, if any..), in this
regard.
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Many thanks for the inputs. I will try to build it today & see how it
goes.
Could you please let me know, any approximate date (or month) as to, when
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
this fix)" would be officially released (for example: by "june 2016" or
"dec 2016" or later) ? It would be very helpful, for my planning.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Date: 04/04/2016 10:05 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
I don't think you can get the artifacts produced by our CI builds, but you
can check out the branch and build it using instructions in the
repository.
On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding the support for "date, timestamp, decimal" data types for
> Parquet-files:
>
> In your earlier mail, you have mentioned the pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318 has the necessary support
> for these data-types (and that it would be released as part of
> parquet-avro-release:1.9.0).
>
> I see that, this fix is included in build# 1247 (& above?). How to get
> this build (or the latest-build), that includes the JAR-file:
> "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> you please let me know.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/14/2016 09:56 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Ravi,
>
> Support for those types in parquet-avro hasn't been committed yet. It's
> implemented in the branch I pointed you to. If you want to use released
> versions, it should be out in 1.9.0.
>
> rb
>
> On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Thanks for the inputs.
> >
> > I am building & running the test-application, primarily using the
> > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> >
> > 1) avro-1.8.0.jar
> > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > maven-repository-URL:
> > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > 3) hive-exec-1.2.1.jar
> >
> > Am I supposed to build/run the test, using a different version of the
> > JAR-files ? Could you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 10:54 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Yes, it is supported in 1.2.1. It went in here:
> >
> >
> >
> >
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> >
> > Are you using a version of Parquet with that pull request in it? Also,
> if
> > you're using CDH this may not work.
> >
> > rb
> >
> > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > I am using hive-version: 1.2.1, as indicated below:
> > >
> > > --------------------------------------
> > > $ hive --version
> > > Hive 1.2.1
> > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > $
> > > --------------------------------------
> > >
> > > As I understand, this version of "hive" supports "date" datatype.
> right
> > ?.
> > > Do you want me to re-test using any other higher-version of hive ?
Pl.
> > let
> > > me know your thoughts.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 06:18 AM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > What version of Hive are you using? You should make sure date is
> > supported
> > > there.
> > >
> > > rb
> > >
> > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > Many thanks for the reply. I see that, the text-attachment
> containing
> > my
> > > > test-program is not sent to the mail-group, but got filtered out.
> > Hence,
> > > > copying the program-code below:
> > > >
> > > > =================================================================
> > > > import java.io.IOException;
> > > > import java.util.*;
> > > > import org.apache.hadoop.conf.Configuration;
> > > > import org.apache.hadoop.fs.FileSystem;
> > > > import org.apache.hadoop.fs.Path;
> > > > import org.apache.avro.Schema;
> > > > import org.apache.avro.Schema.Type;
> > > > import org.apache.avro.Schema.Field;
> > > > import org.apache.avro.generic.* ;
> > > > import org.apache.avro.LogicalTypes;
> > > > import org.apache.avro.LogicalTypes.*;
> > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > import parquet.avro.*;
> > > >
> > > > public class pqtw {
> > > >
> > > > public static Schema makeSchema() {
> > > > List<Field> fields = new ArrayList<Field>();
> > > > fields.add(new Field("name", Schema.create(Type.STRING),
null,
> > > > null));
> > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > null));
> > > >
> > > > Schema date =
> > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > fields.add(new Field("doj", date, null, null));
> > > >
> > > > Schema schema = Schema.createRecord("filecc", null,
"parquet",
> > > > false);
> > > > schema.setFields(fields);
> > > >
> > > > return(schema);
> > > > }
> > > >
> > > > public static GenericData.Record makeRecord (Schema schema, String
> > name,
> > > > int age, int doj) {
> > > > GenericData.Record record = new GenericData.Record(schema);
> > > > record.put("name", name);
> > > > record.put("age", age);
> > > > record.put("doj", doj);
> > > > return(record);
> > > > }
> > > >
> > > > public static void main(String[] args) throws IOException,
> > > >
> > > > InterruptedException, ClassNotFoundException {
> > > >
> > > > String pqfile = "/tmp/pqtfile1";
> > > >
> > > > try {
> > > >
> > > > Configuration conf = new Configuration();
> > > > FileSystem fs = FileSystem.getLocal(conf);
> > > >
> > > > Schema schema = makeSchema() ;
> > > > GenericData.Record rec = makeRecord(schema,"abcd",
21,15000)
> ;
> > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > Path(pqfile),
> > > > schema);
> > > > writer.write(rec);
> > > > writer.close();
> > > > }
> > > > catch (Exception e)
> > > > {
> > > > e.printStackTrace();
> > > > }
> > > > }
> > > > }
> > > > =================================================================
> > > >
> > > > With the above logic, I could write the data to parquet-file.
> However,
> > > > when I load the same into a hive-table & select columns, I could
> > select
> > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> successfully,
> > > but
> > > > select of "date" column failed with the error given below:
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date)
STORED
> AS
> > > > PARQUET ;
> > > > OK
> > > > Time taken: 0.369 seconds
> > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > hive> SELECT name,age from PT1;
> > > > OK
> > > > abcd 21
> > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > hive> SELECT doj from PT1;
> > > > OK
> > > > Failed with exception
> > > >
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> cannot
> > be
> > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > Time taken: 0.167 seconds
> > > > hive>
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > >
> > > > Basically, for "date datatype", I am trying to pass an
integer-value
> > > (for
> > > > the # of days from Unix epoch, 1 January 1970, so that the date
> falls
> > > > somewhere around 2011..etc). Is this the correct approach to
process
> > > date
> > > > data (or is there any other approach / API to do it) ? Could you
> > please
> > > > let me know your inputs, in this regard ?
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/09/2016 10:48 PM
> > > > Subject: Re: How to write "date, timestamp, decimal" data
to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > Hi Ravi,
> > > >
> > > > Not all of the types are fully-implemented yet. I think Hive only
> has
> > > > partial support. If I remember correctly:
> > > > * Decimal is supported if the backing primitive type is
fixed-length
> > > > binary
> > > > * Date and Timestamp are supported, but Time has not been
> implemented
> > > yet
> > > >
> > > > For object models you can build applications on (instead of those
> > > embedded
> > > > in SQL), only Avro objects can support those types through its
> > > > LogicalTypes
> > > > API. That API has been implemented in parquet-avro, but not yet
> > > committed.
> > > > I would like for this feature to make it into 1.9.0. If you want
to
> > test
> > > > in
> > > > the mean time, check out the pull request:
> > > >
> > > > https://github.com/apache/parquet-mr/pull/318
> > > >
> > > > rb
> > > >
> > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > test-tool,
> > > > > that writes data to Parquet-files, which can be imported into
> > > > hive-tables.
> > > > > Pl. find attached sample-program, which writes simple
> > > parquet-data-file:
> > > > >
> > > > >
> > > > >
> > > > > Using the above program, I could create "parquet-files" with
> > > data-types:
> > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > supported
> > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > successfully.
> > > > >
> > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > decimal
> > > > > data" into parquet-files. In this context, I request you
provide
> > the
> > > > > possible options (and/or sample-program, if any..), in this
> regard.
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
I have downloaded the source via the "pull-request-URL:
https://github.com/apache/parquet-mr/pull/318" (did a "fork" & downloaded
the source-ZIP-file) & built it using maven. The build completed
successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I
tried to verify "date" data type using this JAR-file, I realized that, the
existing test-programs are failing with build with this new JAR.
So far, I have my test-programs built (and run) using
"parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs
using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed.
After going thro' the source-code, I realized that, there are many changes
in the API, between "1.6.0" & "1.8.2", because of which the
sample-programs that built with "1.6.0" are not building now. (It looks
like, now the "AvroParquetWriter" doesn't have the methods: "write",
"close"...etc, but using some other approach. Do you know, why these
methods are removed completely & made incompatible with parquet-avro-1.6.0
?)
Pl. find below a sample parquet-write program, which is now failing with
"parquet-avro-1.8.2-snapshot.jar". Do you have any sample
parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write
primitive data types such as: "int", "char"..etc, to a parquet-file, as
shown in the below example) ? If yes, could you please point me to the
same.
=================================================================================================
public static Schema makeSchema() {
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("name", Schema.create(Type.STRING), null,
null));
fields.add(new Field("age", Schema.create(Type.INT), null, null));
fields.add(new Field("dept", Schema.create(Type.STRING), null,
null));
Schema schema = Schema.createRecord("filecc", null, "parquet",
false);
schema.setFields(fields);
return(schema);
}
public static GenericData.Record makeRecord (Schema schema, String name,
int age, String dept) {
GenericData.Record record = new GenericData.Record(schema);
record.put("name", name);
record.put("age", age);
record.put("dept", dept);
return(record);
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String pqfile = "/tmp/pqtfile1";
try {
conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Schema schema = makeSchema() ;
GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2") ;
AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
schema);
writer.write(rec);
writer.close() ;
} catch (Exception e) { e.printStackTrace(); }
=================================================================================================
Thanks,
Ravi
From: Ravi Tatapudi/India/IBM
To: dev@parquet.apache.org
Date: 04/05/2016 10:53 AM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Hello Ryan:
Many thanks for the inputs. I will try to build it today & see how it
goes.
Could you please let me know, any approximate date (or month) as to, when
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
this fix)" would be officially released (for example: by "june 2016" or
"dec 2016" or later) ? It would be very helpful, for my planning.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Date: 04/04/2016 10:05 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
I don't think you can get the artifacts produced by our CI builds, but you
can check out the branch and build it using instructions in the
repository.
On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding the support for "date, timestamp, decimal" data types for
> Parquet-files:
>
> In your earlier mail, you have mentioned the pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318 has the necessary support
> for these data-types (and that it would be released as part of
> parquet-avro-release:1.9.0).
>
> I see that, this fix is included in build# 1247 (& above?). How to get
> this build (or the latest-build), that includes the JAR-file:
> "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> you please let me know.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/14/2016 09:56 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Ravi,
>
> Support for those types in parquet-avro hasn't been committed yet. It's
> implemented in the branch I pointed you to. If you want to use released
> versions, it should be out in 1.9.0.
>
> rb
>
> On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Thanks for the inputs.
> >
> > I am building & running the test-application, primarily using the
> > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> >
> > 1) avro-1.8.0.jar
> > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > maven-repository-URL:
> > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > 3) hive-exec-1.2.1.jar
> >
> > Am I supposed to build/run the test, using a different version of the
> > JAR-files ? Could you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 10:54 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Yes, it is supported in 1.2.1. It went in here:
> >
> >
> >
> >
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> >
> > Are you using a version of Parquet with that pull request in it? Also,
> if
> > you're using CDH this may not work.
> >
> > rb
> >
> > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > I am using hive-version: 1.2.1, as indicated below:
> > >
> > > --------------------------------------
> > > $ hive --version
> > > Hive 1.2.1
> > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > $
> > > --------------------------------------
> > >
> > > As I understand, this version of "hive" supports "date" datatype.
> right
> > ?.
> > > Do you want me to re-test using any other higher-version of hive ?
Pl.
> > let
> > > me know your thoughts.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 06:18 AM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > What version of Hive are you using? You should make sure date is
> > supported
> > > there.
> > >
> > > rb
> > >
> > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > Many thanks for the reply. I see that, the text-attachment
> containing
> > my
> > > > test-program is not sent to the mail-group, but got filtered out.
> > Hence,
> > > > copying the program-code below:
> > > >
> > > > =================================================================
> > > > import java.io.IOException;
> > > > import java.util.*;
> > > > import org.apache.hadoop.conf.Configuration;
> > > > import org.apache.hadoop.fs.FileSystem;
> > > > import org.apache.hadoop.fs.Path;
> > > > import org.apache.avro.Schema;
> > > > import org.apache.avro.Schema.Type;
> > > > import org.apache.avro.Schema.Field;
> > > > import org.apache.avro.generic.* ;
> > > > import org.apache.avro.LogicalTypes;
> > > > import org.apache.avro.LogicalTypes.*;
> > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > import parquet.avro.*;
> > > >
> > > > public class pqtw {
> > > >
> > > > public static Schema makeSchema() {
> > > > List<Field> fields = new ArrayList<Field>();
> > > > fields.add(new Field("name", Schema.create(Type.STRING),
null,
> > > > null));
> > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > null));
> > > >
> > > > Schema date =
> > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > fields.add(new Field("doj", date, null, null));
> > > >
> > > > Schema schema = Schema.createRecord("filecc", null,
"parquet",
> > > > false);
> > > > schema.setFields(fields);
> > > >
> > > > return(schema);
> > > > }
> > > >
> > > > public static GenericData.Record makeRecord (Schema schema, String
> > name,
> > > > int age, int doj) {
> > > > GenericData.Record record = new GenericData.Record(schema);
> > > > record.put("name", name);
> > > > record.put("age", age);
> > > > record.put("doj", doj);
> > > > return(record);
> > > > }
> > > >
> > > > public static void main(String[] args) throws IOException,
> > > >
> > > > InterruptedException, ClassNotFoundException {
> > > >
> > > > String pqfile = "/tmp/pqtfile1";
> > > >
> > > > try {
> > > >
> > > > Configuration conf = new Configuration();
> > > > FileSystem fs = FileSystem.getLocal(conf);
> > > >
> > > > Schema schema = makeSchema() ;
> > > > GenericData.Record rec = makeRecord(schema,"abcd",
21,15000)
> ;
> > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > Path(pqfile),
> > > > schema);
> > > > writer.write(rec);
> > > > writer.close();
> > > > }
> > > > catch (Exception e)
> > > > {
> > > > e.printStackTrace();
> > > > }
> > > > }
> > > > }
> > > > =================================================================
> > > >
> > > > With the above logic, I could write the data to parquet-file.
> However,
> > > > when I load the same into a hive-table & select columns, I could
> > select
> > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> successfully,
> > > but
> > > > select of "date" column failed with the error given below:
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date)
STORED
> AS
> > > > PARQUET ;
> > > > OK
> > > > Time taken: 0.369 seconds
> > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > hive> SELECT name,age from PT1;
> > > > OK
> > > > abcd 21
> > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > hive> SELECT doj from PT1;
> > > > OK
> > > > Failed with exception
> > > >
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> cannot
> > be
> > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > Time taken: 0.167 seconds
> > > > hive>
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > >
> > > > Basically, for "date datatype", I am trying to pass an
integer-value
> > > (for
> > > > the # of days from Unix epoch, 1 January 1970, so that the date
> falls
> > > > somewhere around 2011..etc). Is this the correct approach to
process
> > > date
> > > > data (or is there any other approach / API to do it) ? Could you
> > please
> > > > let me know your inputs, in this regard ?
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/09/2016 10:48 PM
> > > > Subject: Re: How to write "date, timestamp, decimal" data
to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > Hi Ravi,
> > > >
> > > > Not all of the types are fully-implemented yet. I think Hive only
> has
> > > > partial support. If I remember correctly:
> > > > * Decimal is supported if the backing primitive type is
fixed-length
> > > > binary
> > > > * Date and Timestamp are supported, but Time has not been
> implemented
> > > yet
> > > >
> > > > For object models you can build applications on (instead of those
> > > embedded
> > > > in SQL), only Avro objects can support those types through its
> > > > LogicalTypes
> > > > API. That API has been implemented in parquet-avro, but not yet
> > > committed.
> > > > I would like for this feature to make it into 1.9.0. If you want
to
> > test
> > > > in
> > > > the mean time, check out the pull request:
> > > >
> > > > https://github.com/apache/parquet-mr/pull/318
> > > >
> > > > rb
> > > >
> > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > test-tool,
> > > > > that writes data to Parquet-files, which can be imported into
> > > > hive-tables.
> > > > > Pl. find attached sample-program, which writes simple
> > > parquet-data-file:
> > > > >
> > > > >
> > > > >
> > > > > Using the above program, I could create "parquet-files" with
> > > data-types:
> > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > supported
> > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > successfully.
> > > > >
> > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > decimal
> > > > > data" into parquet-files. In this context, I request you
provide
> > the
> > > > > possible options (and/or sample-program, if any..), in this
> regard.
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Thanks Ryan, for the info.
Regards,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Date: 04/05/2016 09:07 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Ravi,
The only breaking API changes were the renamed packages between 1.6.0 and
1.7.0. Other changes are binary compatible and we have no plans to
deprecate the API you're using. For the release date, I don't know yet. We
haven't closed out all of the 1.9.0 issues yet.
rb
On Tue, Apr 5, 2016 at 5:35 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding my question on compatibility between versions: 1.6.0 & 1.8.2":
>
> My apologies for the confusion caused. After investigating further, I
> realized that, the functionality is now in different JARs. With the
> version: 1.6.0, I only included the JAR-file: "parquet-avro-1.6.0.jar"
> during build & execution of the programs.
>
> Now, I see that, I should include the JARs: parquet-avro-1.8.2.jar,
> parquet-hadoop-1.8.2.jar at build-time & include the JARs:
> parquet-format-2.3.1.jar, parquet-column-1.8.2.jar,
> parquet-common-1.8.2.jar, parquet-encoding-1.8.2.jar, for running the
> programs). After doing that, I could build my old applications
> successfully (of course, I had to change some of the import-statements
> from "import parquet.avro" to "import org.apache.parquet.avro"...etc) &
> run the tests successfully.
>
> So, my outstanding queries are:
>
> 1) I believe, now all my tests are using the "depricatedAPI" for
> AvroParquetWriter. If you have a sample-program using the
latest-approach,
> I request you to point me to the same.
> 2) If you are aware of any approximate date (or month) as to, when
> "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
> this fix)" would be officially released (for example: by "june 2016" or
> "dec 2016" or later), then I request you to please let me know. It would
> be very helpful, for my planning.
>
> Many thanks for your support & help, in this regard.
>
> Thanks,
> Ravi
>
>
>
> From: Ravi Tatapudi/India/IBM
> To: dev@parquet.apache.org
> Date: 04/05/2016 04:29 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
> Hello Ryan:
>
> I have downloaded the source via the "pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318" (did a "fork" &
downloaded
> the source-ZIP-file) & built it using maven. The build completed
> successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I
> tried to verify "date" data type using this JAR-file, I realized that,
the
> existing test-programs are failing with build with this new JAR.
>
> So far, I have my test-programs built (and run) using
> "parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs
> using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed.
> After going thro' the source-code, I realized that, there are many
changes
> in the API, between "1.6.0" & "1.8.2", because of which the
> sample-programs that built with "1.6.0" are not building now. (It looks
> like, now the "AvroParquetWriter" doesn't have the methods: "write",
> "close"...etc, but using some other approach. Do you know, why these
> methods are removed completely & made incompatible with
parquet-avro-1.6.0
> ?)
>
> Pl. find below a sample parquet-write program, which is now failing with
> "parquet-avro-1.8.2-snapshot.jar". Do you have any sample
> parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write
> primitive data types such as: "int", "char"..etc, to a parquet-file, as
> shown in the below example) ? If yes, could you please point me to the
> same.
>
>
>
=================================================================================================
> public static Schema makeSchema() {
> List<Field> fields = new ArrayList<Field>();
> fields.add(new Field("name", Schema.create(Type.STRING), null,
> null));
> fields.add(new Field("age", Schema.create(Type.INT), null, null));
> fields.add(new Field("dept", Schema.create(Type.STRING), null,
> null));
>
> Schema schema = Schema.createRecord("filecc", null, "parquet",
> false);
> schema.setFields(fields);
> return(schema);
> }
>
> public static GenericData.Record makeRecord (Schema schema, String name,
> int age, String dept) {
> GenericData.Record record = new GenericData.Record(schema);
> record.put("name", name);
> record.put("age", age);
> record.put("dept", dept);
> return(record);
> }
>
> public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
>
> String pqfile = "/tmp/pqtfile1";
> try {
> conf = new Configuration();
> FileSystem fs = FileSystem.getLocal(conf);
>
> Schema schema = makeSchema() ;
> GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2")
;
> AvroParquetWriter writer = new AvroParquetWriter(new
Path(pqfile),
> schema);
> writer.write(rec);
> writer.close() ;
>
> } catch (Exception e) { e.printStackTrace(); }
>
>
=================================================================================================
>
> Thanks,
> Ravi
>
>
>
>
> From: Ravi Tatapudi/India/IBM
> To: dev@parquet.apache.org
> Date: 04/05/2016 10:53 AM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
> Hello Ryan:
>
> Many thanks for the inputs. I will try to build it today & see how it
> goes.
>
> Could you please let me know, any approximate date (or month) as to,
when
> "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
> this fix)" would be officially released (for example: by "june 2016" or
> "dec 2016" or later) ? It would be very helpful, for my planning.
>
> Thanks,
> Ravi
>
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Date: 04/04/2016 10:05 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> I don't think you can get the artifacts produced by our CI builds, but
you
> can check out the branch and build it using instructions in the
> repository.
>
> On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Regarding the support for "date, timestamp, decimal" data types for
> > Parquet-files:
> >
> > In your earlier mail, you have mentioned the pull-request-URL:
> > https://github.com/apache/parquet-mr/pull/318 has the necessary
support
> > for these data-types (and that it would be released as part of
> > parquet-avro-release:1.9.0).
> >
> > I see that, this fix is included in build# 1247 (& above?). How to get
> > this build (or the latest-build), that includes the JAR-file:
> > "parquet-avro" including the support for "date,timestamp"..etc. ?
Could
> > you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/14/2016 09:56 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Ravi,
> >
> > Support for those types in parquet-avro hasn't been committed yet.
It's
> > implemented in the branch I pointed you to. If you want to use
released
> > versions, it should be out in 1.9.0.
> >
> > rb
> >
> > On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Thanks for the inputs.
> > >
> > > I am building & running the test-application, primarily using the
> > > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> > >
> > > 1) avro-1.8.0.jar
> > > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > > maven-repository-URL:
> > > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > > 3) hive-exec-1.2.1.jar
> > >
> > > Am I supposed to build/run the test, using a different version of
the
> > > JAR-files ? Could you please let me know.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 10:54 PM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Yes, it is supported in 1.2.1. It went in here:
> > >
> > >
> > >
> > >
> >
> >
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> > >
> > >
> > > Are you using a version of Parquet with that pull request in it?
Also,
> > if
> > > you're using CDH this may not work.
> > >
> > > rb
> > >
> > > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > I am using hive-version: 1.2.1, as indicated below:
> > > >
> > > > --------------------------------------
> > > > $ hive --version
> > > > Hive 1.2.1
> > > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > > $
> > > > --------------------------------------
> > > >
> > > > As I understand, this version of "hive" supports "date" datatype.
> > right
> > > ?.
> > > > Do you want me to re-test using any other higher-version of hive ?
> Pl.
> > > let
> > > > me know your thoughts.
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/11/2016 06:18 AM
> > > > Subject: Re: How to write "date, timestamp, decimal" data
to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > What version of Hive are you using? You should make sure date is
> > > supported
> > > > there.
> > > >
> > > > rb
> > > >
> > > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello Ryan:
> > > > >
> > > > > Many thanks for the reply. I see that, the text-attachment
> > containing
> > > my
> > > > > test-program is not sent to the mail-group, but got filtered
out.
> > > Hence,
> > > > > copying the program-code below:
> > > > >
> > > > >
=================================================================
> > > > > import java.io.IOException;
> > > > > import java.util.*;
> > > > > import org.apache.hadoop.conf.Configuration;
> > > > > import org.apache.hadoop.fs.FileSystem;
> > > > > import org.apache.hadoop.fs.Path;
> > > > > import org.apache.avro.Schema;
> > > > > import org.apache.avro.Schema.Type;
> > > > > import org.apache.avro.Schema.Field;
> > > > > import org.apache.avro.generic.* ;
> > > > > import org.apache.avro.LogicalTypes;
> > > > > import org.apache.avro.LogicalTypes.*;
> > > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > > import parquet.avro.*;
> > > > >
> > > > > public class pqtw {
> > > > >
> > > > > public static Schema makeSchema() {
> > > > > List<Field> fields = new ArrayList<Field>();
> > > > > fields.add(new Field("name", Schema.create(Type.STRING),
> null,
> > > > > null));
> > > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > > null));
> > > > >
> > > > > Schema date =
> > > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > > fields.add(new Field("doj", date, null, null));
> > > > >
> > > > > Schema schema = Schema.createRecord("filecc", null,
> "parquet",
> > > > > false);
> > > > > schema.setFields(fields);
> > > > >
> > > > > return(schema);
> > > > > }
> > > > >
> > > > > public static GenericData.Record makeRecord (Schema schema,
String
> > > name,
> > > > > int age, int doj) {
> > > > > GenericData.Record record = new GenericData.Record(schema);
> > > > > record.put("name", name);
> > > > > record.put("age", age);
> > > > > record.put("doj", doj);
> > > > > return(record);
> > > > > }
> > > > >
> > > > > public static void main(String[] args) throws IOException,
> > > > >
> > > > > InterruptedException, ClassNotFoundException {
> > > > >
> > > > > String pqfile = "/tmp/pqtfile1";
> > > > >
> > > > > try {
> > > > >
> > > > > Configuration conf = new Configuration();
> > > > > FileSystem fs = FileSystem.getLocal(conf);
> > > > >
> > > > > Schema schema = makeSchema() ;
> > > > > GenericData.Record rec = makeRecord(schema,"abcd",
> 21,15000)
> > ;
> > > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > > Path(pqfile),
> > > > > schema);
> > > > > writer.write(rec);
> > > > > writer.close();
> > > > > }
> > > > > catch (Exception e)
> > > > > {
> > > > > e.printStackTrace();
> > > > > }
> > > > > }
> > > > > }
> > > > >
=================================================================
> > > > >
> > > > > With the above logic, I could write the data to parquet-file.
> > However,
> > > > > when I load the same into a hive-table & select columns, I could
> > > select
> > > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> > successfully,
> > > > but
> > > > > select of "date" column failed with the error given below:
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date)
> STORED
> > AS
> > > > > PARQUET ;
> > > > > OK
> > > > > Time taken: 0.369 seconds
> > > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > > hive> SELECT name,age from PT1;
> > > > > OK
> > > > > abcd 21
> > > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > > hive> SELECT doj from PT1;
> > > > > OK
> > > > > Failed with exception
> > > > >
> > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> > cannot
> > > be
> > > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > > Time taken: 0.167 seconds
> > > > > hive>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > > >
> > > > > Basically, for "date datatype", I am trying to pass an
> integer-value
> > > > (for
> > > > > the # of days from Unix epoch, 1 January 1970, so that the date
> > falls
> > > > > somewhere around 2011..etc). Is this the correct approach to
> process
> > > > date
> > > > > data (or is there any other approach / API to do it) ? Could you
> > > please
> > > > > let me know your inputs, in this regard ?
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > > >
> > > > >
> > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > > Mudigonda/India/IBM@IBMIN
> > > > > Date: 03/09/2016 10:48 PM
> > > > > Subject: Re: How to write "date, timestamp, decimal" data
> to
> > > > > Parquet-files
> > > > >
> > > > >
> > > > >
> > > > > Hi Ravi,
> > > > >
> > > > > Not all of the types are fully-implemented yet. I think Hive
only
> > has
> > > > > partial support. If I remember correctly:
> > > > > * Decimal is supported if the backing primitive type is
> fixed-length
> > > > > binary
> > > > > * Date and Timestamp are supported, but Time has not been
> > implemented
> > > > yet
> > > > >
> > > > > For object models you can build applications on (instead of
those
> > > > embedded
> > > > > in SQL), only Avro objects can support those types through its
> > > > > LogicalTypes
> > > > > API. That API has been implemented in parquet-avro, but not yet
> > > > committed.
> > > > > I would like for this feature to make it into 1.9.0. If you want
> to
> > > test
> > > > > in
> > > > > the mean time, check out the pull request:
> > > > >
> > > > > https://github.com/apache/parquet-mr/pull/318
> > > > >
> > > > > rb
> > > > >
> > > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > > <ra...@in.ibm.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > > test-tool,
> > > > > > that writes data to Parquet-files, which can be imported into
> > > > > hive-tables.
> > > > > > Pl. find attached sample-program, which writes simple
> > > > parquet-data-file:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Using the above program, I could create "parquet-files" with
> > > > data-types:
> > > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all
data-types
> > > > > supported
> > > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > > successfully.
> > > > > >
> > > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > > decimal
> > > > > > data" into parquet-files. In this context, I request you
> provide
> > > the
> > > > > > possible options (and/or sample-program, if any..), in this
> > regard.
> > > > > >
> > > > > > Thanks,
> > > > > > Ravi
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Using "parquet-avro-1.8.2" (that I have built from the pull-request#
https://github.com/apache/parquet-mr/pull/318), I have tried creating a
"logical type for date" to write to parquet-file, using the code-block
below:
-----------------------------------------------------------------------
34 LogicalType AvroDate = new LogicalType("AvroDate") ;
35 Schema Pdate = AvroDate.addToSchema(Schema.create(Type.INT)) ;
36 fields.add(new Field("doj", Pdate, null, null));
-----------------------------------------------------------------------
The program is built successfully. But, when I try to run the program, I
got the below exeception:
----------------------------------------------------------------------------------------------------------------------
Exception in thread "main" java.lang.NoSuchMethodError:
org/apache/avro/Schema.setLogicalType(Lorg/apache/avro/LogicalType;)V
at org.apache.avro.LogicalType.addToSchema(LogicalType.java:72)
at pqtw.makeSchema(pqtw.java:35)
at pqtw.main(pqtw.java:63)
----------------------------------------------------------------------------------------------------------------------
I am using "parquet-avro-1.8.2.jar" & "avro-1.8.0.jar". The error is
indicating that the method: "org/apache/avro/Schema.setLogicalType" is NOT
found. From the trace, it looks like "addToSchema" function is in turn
calling "setLogicalType" in schema class, which is where it is failing
with "NoMethodFound" exception.
Hence, I am trying to understand, whether it is a correct way to create a
"LogicalType" (or) if there is any other approach (or if I should use a
"higher version" of "avro.jar", if any...) ?
Could you please let me know your inputs in this regard.. (or do you
suppose, this question should go to "AVRO-mailing-list" ?) Pl. let me know
your thoughts.
Thanks,
Ravi
NOTE:
Pl. find below the full-code of the test-program, if you wish to have a
look. FYI only.
======================================================================================
1 import java.io.IOException;
2 import java.util.*;
3
4 import org.apache.hadoop.conf.Configuration;
5 import org.apache.hadoop.fs.FileSystem;
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.io.Text;
8
9 import org.apache.parquet.avro.*;
10
11 import org.apache.avro.Schema;
12 import org.apache.avro.Schema.Type;
13 import org.apache.avro.Schema.Field;
14 import org.apache.avro.LogicalType;
15 import org.apache.avro.LogicalTypes;
16
17 import org.apache.parquet.column.ParquetProperties.WriterVersion;
18
19 import org.apache.parquet.hadoop.api.WriteSupport;
20 import org.apache.parquet.hadoop.ParquetWriter;
21 import org.apache.parquet.hadoop.ParquetWriter.*;
22 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
23
24 import org.apache.avro.generic.* ;
25
26 public class pqtw {
27
28 public static Schema makeSchema() {
29 List<Field> fields = new ArrayList<Field>();
30 fields.add(new Field("name", Schema.create(Type.STRING), null,
null));
31 fields.add(new Field("age", Schema.create(Type.INT), null,
null));
32 //fields.add(new Field("doj", Schema.create(Type.INT), null,
null));
33
34 LogicalType AvroDate = new LogicalType("AvroDate") ;
35 Schema Pdate = AvroDate.addToSchema(Schema.create(Type.INT)) ;
36 fields.add(new Field("doj", Pdate, null, null));
37
38 Schema schema = Schema.createRecord("filecc", null, "parquet",
false);
39 schema.setFields(fields);
40
41 return(schema);
42 }
43
44 public static GenericData.Record makeRecord (Schema schema, String
name, int age, int doj) {
45 GenericData.Record record = new GenericData.Record(schema);
46 record.put("name", name);
47 record.put("age", age);
48 record.put("doj", doj);
49 return(record);
50 }
51
52 public static void main(String[] args) throws IOException,
53
54 InterruptedException, ClassNotFoundException {
55
56 String pqfile = "/tmp/pqtfile2";
57
58 try {
59
60 Configuration conf = new Configuration();
61 FileSystem fs = FileSystem.getLocal(conf);
62
63 Schema schema = makeSchema() ;
64 GenericData.Record rec = makeRecord(schema,"abcd", 5,15000) ;
65 AvroParquetWriter writer = new AvroParquetWriter(new
Path(pqfile), schema) ;
66 writer.write(rec);
67 writer.close();
68 }
69 catch (Exception e)
70 {
71 e.printStackTrace();
72 }
73 }
74 }
======================================================================================
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Ravi,
The only breaking API changes were the renamed packages between 1.6.0 and
1.7.0. Other changes are binary compatible and we have no plans to
deprecate the API you're using. For the release date, I don't know yet. We
haven't closed out all of the 1.9.0 issues yet.
rb
On Tue, Apr 5, 2016 at 5:35 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding my question on compatibility between versions: 1.6.0 & 1.8.2":
>
> My apologies for the confusion caused. After investigating further, I
> realized that, the functionality is now in different JARs. With the
> version: 1.6.0, I only included the JAR-file: "parquet-avro-1.6.0.jar"
> during build & execution of the programs.
>
> Now, I see that, I should include the JARs: parquet-avro-1.8.2.jar,
> parquet-hadoop-1.8.2.jar at build-time & include the JARs:
> parquet-format-2.3.1.jar, parquet-column-1.8.2.jar,
> parquet-common-1.8.2.jar, parquet-encoding-1.8.2.jar, for running the
> programs). After doing that, I could build my old applications
> successfully (of course, I had to change some of the import-statements
> from "import parquet.avro" to "import org.apache.parquet.avro"...etc) &
> run the tests successfully.
>
> So, my outstanding queries are:
>
> 1) I believe, now all my tests are using the "depricatedAPI" for
> AvroParquetWriter. If you have a sample-program using the latest-approach,
> I request you to point me to the same.
> 2) If you are aware of any approximate date (or month) as to, when
> "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
> this fix)" would be officially released (for example: by "june 2016" or
> "dec 2016" or later), then I request you to please let me know. It would
> be very helpful, for my planning.
>
> Many thanks for your support & help, in this regard.
>
> Thanks,
> Ravi
>
>
>
> From: Ravi Tatapudi/India/IBM
> To: dev@parquet.apache.org
> Date: 04/05/2016 04:29 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
> Hello Ryan:
>
> I have downloaded the source via the "pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318" (did a "fork" & downloaded
> the source-ZIP-file) & built it using maven. The build completed
> successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I
> tried to verify "date" data type using this JAR-file, I realized that, the
> existing test-programs are failing with build with this new JAR.
>
> So far, I have my test-programs built (and run) using
> "parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs
> using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed.
> After going thro' the source-code, I realized that, there are many changes
> in the API, between "1.6.0" & "1.8.2", because of which the
> sample-programs that built with "1.6.0" are not building now. (It looks
> like, now the "AvroParquetWriter" doesn't have the methods: "write",
> "close"...etc, but using some other approach. Do you know, why these
> methods are removed completely & made incompatible with parquet-avro-1.6.0
> ?)
>
> Pl. find below a sample parquet-write program, which is now failing with
> "parquet-avro-1.8.2-snapshot.jar". Do you have any sample
> parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write
> primitive data types such as: "int", "char"..etc, to a parquet-file, as
> shown in the below example) ? If yes, could you please point me to the
> same.
>
>
> =================================================================================================
> public static Schema makeSchema() {
> List<Field> fields = new ArrayList<Field>();
> fields.add(new Field("name", Schema.create(Type.STRING), null,
> null));
> fields.add(new Field("age", Schema.create(Type.INT), null, null));
> fields.add(new Field("dept", Schema.create(Type.STRING), null,
> null));
>
> Schema schema = Schema.createRecord("filecc", null, "parquet",
> false);
> schema.setFields(fields);
> return(schema);
> }
>
> public static GenericData.Record makeRecord (Schema schema, String name,
> int age, String dept) {
> GenericData.Record record = new GenericData.Record(schema);
> record.put("name", name);
> record.put("age", age);
> record.put("dept", dept);
> return(record);
> }
>
> public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
>
> String pqfile = "/tmp/pqtfile1";
> try {
> conf = new Configuration();
> FileSystem fs = FileSystem.getLocal(conf);
>
> Schema schema = makeSchema() ;
> GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2") ;
> AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
> schema);
> writer.write(rec);
> writer.close() ;
>
> } catch (Exception e) { e.printStackTrace(); }
>
> =================================================================================================
>
> Thanks,
> Ravi
>
>
>
>
> From: Ravi Tatapudi/India/IBM
> To: dev@parquet.apache.org
> Date: 04/05/2016 10:53 AM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
> Hello Ryan:
>
> Many thanks for the inputs. I will try to build it today & see how it
> goes.
>
> Could you please let me know, any approximate date (or month) as to, when
> "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
> this fix)" would be officially released (for example: by "june 2016" or
> "dec 2016" or later) ? It would be very helpful, for my planning.
>
> Thanks,
> Ravi
>
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Date: 04/04/2016 10:05 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> I don't think you can get the artifacts produced by our CI builds, but you
> can check out the branch and build it using instructions in the
> repository.
>
> On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Regarding the support for "date, timestamp, decimal" data types for
> > Parquet-files:
> >
> > In your earlier mail, you have mentioned the pull-request-URL:
> > https://github.com/apache/parquet-mr/pull/318 has the necessary support
> > for these data-types (and that it would be released as part of
> > parquet-avro-release:1.9.0).
> >
> > I see that, this fix is included in build# 1247 (& above?). How to get
> > this build (or the latest-build), that includes the JAR-file:
> > "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> > you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/14/2016 09:56 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Ravi,
> >
> > Support for those types in parquet-avro hasn't been committed yet. It's
> > implemented in the branch I pointed you to. If you want to use released
> > versions, it should be out in 1.9.0.
> >
> > rb
> >
> > On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Thanks for the inputs.
> > >
> > > I am building & running the test-application, primarily using the
> > > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> > >
> > > 1) avro-1.8.0.jar
> > > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > > maven-repository-URL:
> > > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > > 3) hive-exec-1.2.1.jar
> > >
> > > Am I supposed to build/run the test, using a different version of the
> > > JAR-files ? Could you please let me know.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 10:54 PM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Yes, it is supported in 1.2.1. It went in here:
> > >
> > >
> > >
> > >
> >
> >
>
> https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> > >
> > >
> > > Are you using a version of Parquet with that pull request in it? Also,
> > if
> > > you're using CDH this may not work.
> > >
> > > rb
> > >
> > > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > I am using hive-version: 1.2.1, as indicated below:
> > > >
> > > > --------------------------------------
> > > > $ hive --version
> > > > Hive 1.2.1
> > > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > > $
> > > > --------------------------------------
> > > >
> > > > As I understand, this version of "hive" supports "date" datatype.
> > right
> > > ?.
> > > > Do you want me to re-test using any other higher-version of hive ?
> Pl.
> > > let
> > > > me know your thoughts.
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/11/2016 06:18 AM
> > > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > What version of Hive are you using? You should make sure date is
> > > supported
> > > > there.
> > > >
> > > > rb
> > > >
> > > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello Ryan:
> > > > >
> > > > > Many thanks for the reply. I see that, the text-attachment
> > containing
> > > my
> > > > > test-program is not sent to the mail-group, but got filtered out.
> > > Hence,
> > > > > copying the program-code below:
> > > > >
> > > > > =================================================================
> > > > > import java.io.IOException;
> > > > > import java.util.*;
> > > > > import org.apache.hadoop.conf.Configuration;
> > > > > import org.apache.hadoop.fs.FileSystem;
> > > > > import org.apache.hadoop.fs.Path;
> > > > > import org.apache.avro.Schema;
> > > > > import org.apache.avro.Schema.Type;
> > > > > import org.apache.avro.Schema.Field;
> > > > > import org.apache.avro.generic.* ;
> > > > > import org.apache.avro.LogicalTypes;
> > > > > import org.apache.avro.LogicalTypes.*;
> > > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > > import parquet.avro.*;
> > > > >
> > > > > public class pqtw {
> > > > >
> > > > > public static Schema makeSchema() {
> > > > > List<Field> fields = new ArrayList<Field>();
> > > > > fields.add(new Field("name", Schema.create(Type.STRING),
> null,
> > > > > null));
> > > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > > null));
> > > > >
> > > > > Schema date =
> > > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > > fields.add(new Field("doj", date, null, null));
> > > > >
> > > > > Schema schema = Schema.createRecord("filecc", null,
> "parquet",
> > > > > false);
> > > > > schema.setFields(fields);
> > > > >
> > > > > return(schema);
> > > > > }
> > > > >
> > > > > public static GenericData.Record makeRecord (Schema schema, String
> > > name,
> > > > > int age, int doj) {
> > > > > GenericData.Record record = new GenericData.Record(schema);
> > > > > record.put("name", name);
> > > > > record.put("age", age);
> > > > > record.put("doj", doj);
> > > > > return(record);
> > > > > }
> > > > >
> > > > > public static void main(String[] args) throws IOException,
> > > > >
> > > > > InterruptedException, ClassNotFoundException {
> > > > >
> > > > > String pqfile = "/tmp/pqtfile1";
> > > > >
> > > > > try {
> > > > >
> > > > > Configuration conf = new Configuration();
> > > > > FileSystem fs = FileSystem.getLocal(conf);
> > > > >
> > > > > Schema schema = makeSchema() ;
> > > > > GenericData.Record rec = makeRecord(schema,"abcd",
> 21,15000)
> > ;
> > > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > > Path(pqfile),
> > > > > schema);
> > > > > writer.write(rec);
> > > > > writer.close();
> > > > > }
> > > > > catch (Exception e)
> > > > > {
> > > > > e.printStackTrace();
> > > > > }
> > > > > }
> > > > > }
> > > > > =================================================================
> > > > >
> > > > > With the above logic, I could write the data to parquet-file.
> > However,
> > > > > when I load the same into a hive-table & select columns, I could
> > > select
> > > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> > successfully,
> > > > but
> > > > > select of "date" column failed with the error given below:
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date)
> STORED
> > AS
> > > > > PARQUET ;
> > > > > OK
> > > > > Time taken: 0.369 seconds
> > > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > > hive> SELECT name,age from PT1;
> > > > > OK
> > > > > abcd 21
> > > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > > hive> SELECT doj from PT1;
> > > > > OK
> > > > > Failed with exception
> > > > >
> > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> > cannot
> > > be
> > > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > > Time taken: 0.167 seconds
> > > > > hive>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > > > >
> > > > > Basically, for "date datatype", I am trying to pass an
> integer-value
> > > > (for
> > > > > the # of days from Unix epoch, 1 January 1970, so that the date
> > falls
> > > > > somewhere around 2011..etc). Is this the correct approach to
> process
> > > > date
> > > > > data (or is there any other approach / API to do it) ? Could you
> > > please
> > > > > let me know your inputs, in this regard ?
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > > >
> > > > >
> > > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > > Mudigonda/India/IBM@IBMIN
> > > > > Date: 03/09/2016 10:48 PM
> > > > > Subject: Re: How to write "date, timestamp, decimal" data
> to
> > > > > Parquet-files
> > > > >
> > > > >
> > > > >
> > > > > Hi Ravi,
> > > > >
> > > > > Not all of the types are fully-implemented yet. I think Hive only
> > has
> > > > > partial support. If I remember correctly:
> > > > > * Decimal is supported if the backing primitive type is
> fixed-length
> > > > > binary
> > > > > * Date and Timestamp are supported, but Time has not been
> > implemented
> > > > yet
> > > > >
> > > > > For object models you can build applications on (instead of those
> > > > embedded
> > > > > in SQL), only Avro objects can support those types through its
> > > > > LogicalTypes
> > > > > API. That API has been implemented in parquet-avro, but not yet
> > > > committed.
> > > > > I would like for this feature to make it into 1.9.0. If you want
> to
> > > test
> > > > > in
> > > > > the mean time, check out the pull request:
> > > > >
> > > > > https://github.com/apache/parquet-mr/pull/318
> > > > >
> > > > > rb
> > > > >
> > > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > > <ra...@in.ibm.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > > test-tool,
> > > > > > that writes data to Parquet-files, which can be imported into
> > > > > hive-tables.
> > > > > > Pl. find attached sample-program, which writes simple
> > > > parquet-data-file:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Using the above program, I could create "parquet-files" with
> > > > data-types:
> > > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > > supported
> > > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > > successfully.
> > > > > >
> > > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > > decimal
> > > > > > data" into parquet-files. In this context, I request you
> provide
> > > the
> > > > > > possible options (and/or sample-program, if any..), in this
> > regard.
> > > > > >
> > > > > > Thanks,
> > > > > > Ravi
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Regarding my question on compatibility between versions: 1.6.0 & 1.8.2":
My apologies for the confusion caused. After investigating further, I
realized that, the functionality is now in different JARs. With the
version: 1.6.0, I only included the JAR-file: "parquet-avro-1.6.0.jar"
during build & execution of the programs.
Now, I see that, I should include the JARs: parquet-avro-1.8.2.jar,
parquet-hadoop-1.8.2.jar at build-time & include the JARs:
parquet-format-2.3.1.jar, parquet-column-1.8.2.jar,
parquet-common-1.8.2.jar, parquet-encoding-1.8.2.jar, for running the
programs). After doing that, I could build my old applications
successfully (of course, I had to change some of the import-statements
from "import parquet.avro" to "import org.apache.parquet.avro"...etc) &
run the tests successfully.
So, my outstanding queries are:
1) I believe, now all my tests are using the "depricatedAPI" for
AvroParquetWriter. If you have a sample-program using the latest-approach,
I request you to point me to the same.
2) If you are aware of any approximate date (or month) as to, when
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
this fix)" would be officially released (for example: by "june 2016" or
"dec 2016" or later), then I request you to please let me know. It would
be very helpful, for my planning.
Many thanks for your support & help, in this regard.
Thanks,
Ravi
From: Ravi Tatapudi/India/IBM
To: dev@parquet.apache.org
Date: 04/05/2016 04:29 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Hello Ryan:
I have downloaded the source via the "pull-request-URL:
https://github.com/apache/parquet-mr/pull/318" (did a "fork" & downloaded
the source-ZIP-file) & built it using maven. The build completed
successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I
tried to verify "date" data type using this JAR-file, I realized that, the
existing test-programs are failing with build with this new JAR.
So far, I have my test-programs built (and run) using
"parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs
using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed.
After going thro' the source-code, I realized that, there are many changes
in the API, between "1.6.0" & "1.8.2", because of which the
sample-programs that built with "1.6.0" are not building now. (It looks
like, now the "AvroParquetWriter" doesn't have the methods: "write",
"close"...etc, but using some other approach. Do you know, why these
methods are removed completely & made incompatible with parquet-avro-1.6.0
?)
Pl. find below a sample parquet-write program, which is now failing with
"parquet-avro-1.8.2-snapshot.jar". Do you have any sample
parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write
primitive data types such as: "int", "char"..etc, to a parquet-file, as
shown in the below example) ? If yes, could you please point me to the
same.
=================================================================================================
public static Schema makeSchema() {
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("name", Schema.create(Type.STRING), null,
null));
fields.add(new Field("age", Schema.create(Type.INT), null, null));
fields.add(new Field("dept", Schema.create(Type.STRING), null,
null));
Schema schema = Schema.createRecord("filecc", null, "parquet",
false);
schema.setFields(fields);
return(schema);
}
public static GenericData.Record makeRecord (Schema schema, String name,
int age, String dept) {
GenericData.Record record = new GenericData.Record(schema);
record.put("name", name);
record.put("age", age);
record.put("dept", dept);
return(record);
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String pqfile = "/tmp/pqtfile1";
try {
conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Schema schema = makeSchema() ;
GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2") ;
AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
schema);
writer.write(rec);
writer.close() ;
} catch (Exception e) { e.printStackTrace(); }
=================================================================================================
Thanks,
Ravi
From: Ravi Tatapudi/India/IBM
To: dev@parquet.apache.org
Date: 04/05/2016 10:53 AM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Hello Ryan:
Many thanks for the inputs. I will try to build it today & see how it
goes.
Could you please let me know, any approximate date (or month) as to, when
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include
this fix)" would be officially released (for example: by "june 2016" or
"dec 2016" or later) ? It would be very helpful, for my planning.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Date: 04/04/2016 10:05 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
I don't think you can get the artifacts produced by our CI builds, but you
can check out the branch and build it using instructions in the
repository.
On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding the support for "date, timestamp, decimal" data types for
> Parquet-files:
>
> In your earlier mail, you have mentioned the pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318 has the necessary support
> for these data-types (and that it would be released as part of
> parquet-avro-release:1.9.0).
>
> I see that, this fix is included in build# 1247 (& above?). How to get
> this build (or the latest-build), that includes the JAR-file:
> "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> you please let me know.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/14/2016 09:56 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Ravi,
>
> Support for those types in parquet-avro hasn't been committed yet. It's
> implemented in the branch I pointed you to. If you want to use released
> versions, it should be out in 1.9.0.
>
> rb
>
> On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Thanks for the inputs.
> >
> > I am building & running the test-application, primarily using the
> > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> >
> > 1) avro-1.8.0.jar
> > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > maven-repository-URL:
> > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > 3) hive-exec-1.2.1.jar
> >
> > Am I supposed to build/run the test, using a different version of the
> > JAR-files ? Could you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 10:54 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Yes, it is supported in 1.2.1. It went in here:
> >
> >
> >
> >
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> >
> > Are you using a version of Parquet with that pull request in it? Also,
> if
> > you're using CDH this may not work.
> >
> > rb
> >
> > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > I am using hive-version: 1.2.1, as indicated below:
> > >
> > > --------------------------------------
> > > $ hive --version
> > > Hive 1.2.1
> > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > $
> > > --------------------------------------
> > >
> > > As I understand, this version of "hive" supports "date" datatype.
> right
> > ?.
> > > Do you want me to re-test using any other higher-version of hive ?
Pl.
> > let
> > > me know your thoughts.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 06:18 AM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > What version of Hive are you using? You should make sure date is
> > supported
> > > there.
> > >
> > > rb
> > >
> > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > Many thanks for the reply. I see that, the text-attachment
> containing
> > my
> > > > test-program is not sent to the mail-group, but got filtered out.
> > Hence,
> > > > copying the program-code below:
> > > >
> > > > =================================================================
> > > > import java.io.IOException;
> > > > import java.util.*;
> > > > import org.apache.hadoop.conf.Configuration;
> > > > import org.apache.hadoop.fs.FileSystem;
> > > > import org.apache.hadoop.fs.Path;
> > > > import org.apache.avro.Schema;
> > > > import org.apache.avro.Schema.Type;
> > > > import org.apache.avro.Schema.Field;
> > > > import org.apache.avro.generic.* ;
> > > > import org.apache.avro.LogicalTypes;
> > > > import org.apache.avro.LogicalTypes.*;
> > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > import parquet.avro.*;
> > > >
> > > > public class pqtw {
> > > >
> > > > public static Schema makeSchema() {
> > > > List<Field> fields = new ArrayList<Field>();
> > > > fields.add(new Field("name", Schema.create(Type.STRING),
null,
> > > > null));
> > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > null));
> > > >
> > > > Schema date =
> > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > fields.add(new Field("doj", date, null, null));
> > > >
> > > > Schema schema = Schema.createRecord("filecc", null,
"parquet",
> > > > false);
> > > > schema.setFields(fields);
> > > >
> > > > return(schema);
> > > > }
> > > >
> > > > public static GenericData.Record makeRecord (Schema schema, String
> > name,
> > > > int age, int doj) {
> > > > GenericData.Record record = new GenericData.Record(schema);
> > > > record.put("name", name);
> > > > record.put("age", age);
> > > > record.put("doj", doj);
> > > > return(record);
> > > > }
> > > >
> > > > public static void main(String[] args) throws IOException,
> > > >
> > > > InterruptedException, ClassNotFoundException {
> > > >
> > > > String pqfile = "/tmp/pqtfile1";
> > > >
> > > > try {
> > > >
> > > > Configuration conf = new Configuration();
> > > > FileSystem fs = FileSystem.getLocal(conf);
> > > >
> > > > Schema schema = makeSchema() ;
> > > > GenericData.Record rec = makeRecord(schema,"abcd",
21,15000)
> ;
> > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > Path(pqfile),
> > > > schema);
> > > > writer.write(rec);
> > > > writer.close();
> > > > }
> > > > catch (Exception e)
> > > > {
> > > > e.printStackTrace();
> > > > }
> > > > }
> > > > }
> > > > =================================================================
> > > >
> > > > With the above logic, I could write the data to parquet-file.
> However,
> > > > when I load the same into a hive-table & select columns, I could
> > select
> > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> successfully,
> > > but
> > > > select of "date" column failed with the error given below:
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date)
STORED
> AS
> > > > PARQUET ;
> > > > OK
> > > > Time taken: 0.369 seconds
> > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > hive> SELECT name,age from PT1;
> > > > OK
> > > > abcd 21
> > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > hive> SELECT doj from PT1;
> > > > OK
> > > > Failed with exception
> > > >
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> cannot
> > be
> > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > Time taken: 0.167 seconds
> > > > hive>
> > > >
> > > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > >
> > > > Basically, for "date datatype", I am trying to pass an
integer-value
> > > (for
> > > > the # of days from Unix epoch, 1 January 1970, so that the date
> falls
> > > > somewhere around 2011..etc). Is this the correct approach to
process
> > > date
> > > > data (or is there any other approach / API to do it) ? Could you
> > please
> > > > let me know your inputs, in this regard ?
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/09/2016 10:48 PM
> > > > Subject: Re: How to write "date, timestamp, decimal" data
to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > Hi Ravi,
> > > >
> > > > Not all of the types are fully-implemented yet. I think Hive only
> has
> > > > partial support. If I remember correctly:
> > > > * Decimal is supported if the backing primitive type is
fixed-length
> > > > binary
> > > > * Date and Timestamp are supported, but Time has not been
> implemented
> > > yet
> > > >
> > > > For object models you can build applications on (instead of those
> > > embedded
> > > > in SQL), only Avro objects can support those types through its
> > > > LogicalTypes
> > > > API. That API has been implemented in parquet-avro, but not yet
> > > committed.
> > > > I would like for this feature to make it into 1.9.0. If you want
to
> > test
> > > > in
> > > > the mean time, check out the pull request:
> > > >
> > > > https://github.com/apache/parquet-mr/pull/318
> > > >
> > > > rb
> > > >
> > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > test-tool,
> > > > > that writes data to Parquet-files, which can be imported into
> > > > hive-tables.
> > > > > Pl. find attached sample-program, which writes simple
> > > parquet-data-file:
> > > > >
> > > > >
> > > > >
> > > > > Using the above program, I could create "parquet-files" with
> > > data-types:
> > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > supported
> > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > successfully.
> > > > >
> > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > decimal
> > > > > data" into parquet-files. In this context, I request you
provide
> > the
> > > > > possible options (and/or sample-program, if any..), in this
> regard.
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I don't think you can get the artifacts produced by our CI builds, but you
can check out the branch and build it using instructions in the repository.
On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Regarding the support for "date, timestamp, decimal" data types for
> Parquet-files:
>
> In your earlier mail, you have mentioned the pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318 has the necessary support
> for these data-types (and that it would be released as part of
> parquet-avro-release:1.9.0).
>
> I see that, this fix is included in build# 1247 (& above?). How to get
> this build (or the latest-build), that includes the JAR-file:
> "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> you please let me know.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/14/2016 09:56 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Ravi,
>
> Support for those types in parquet-avro hasn't been committed yet. It's
> implemented in the branch I pointed you to. If you want to use released
> versions, it should be out in 1.9.0.
>
> rb
>
> On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Thanks for the inputs.
> >
> > I am building & running the test-application, primarily using the
> > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> >
> > 1) avro-1.8.0.jar
> > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > maven-repository-URL:
> > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > 3) hive-exec-1.2.1.jar
> >
> > Am I supposed to build/run the test, using a different version of the
> > JAR-files ? Could you please let me know.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 10:54 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Yes, it is supported in 1.2.1. It went in here:
> >
> >
> >
> >
>
> https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
> >
> >
> > Are you using a version of Parquet with that pull request in it? Also,
> if
> > you're using CDH this may not work.
> >
> > rb
> >
> > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > I am using hive-version: 1.2.1, as indicated below:
> > >
> > > --------------------------------------
> > > $ hive --version
> > > Hive 1.2.1
> > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > $
> > > --------------------------------------
> > >
> > > As I understand, this version of "hive" supports "date" datatype.
> right
> > ?.
> > > Do you want me to re-test using any other higher-version of hive ? Pl.
> > let
> > > me know your thoughts.
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/11/2016 06:18 AM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > What version of Hive are you using? You should make sure date is
> > supported
> > > there.
> > >
> > > rb
> > >
> > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > Many thanks for the reply. I see that, the text-attachment
> containing
> > my
> > > > test-program is not sent to the mail-group, but got filtered out.
> > Hence,
> > > > copying the program-code below:
> > > >
> > > > =================================================================
> > > > import java.io.IOException;
> > > > import java.util.*;
> > > > import org.apache.hadoop.conf.Configuration;
> > > > import org.apache.hadoop.fs.FileSystem;
> > > > import org.apache.hadoop.fs.Path;
> > > > import org.apache.avro.Schema;
> > > > import org.apache.avro.Schema.Type;
> > > > import org.apache.avro.Schema.Field;
> > > > import org.apache.avro.generic.* ;
> > > > import org.apache.avro.LogicalTypes;
> > > > import org.apache.avro.LogicalTypes.*;
> > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > import parquet.avro.*;
> > > >
> > > > public class pqtw {
> > > >
> > > > public static Schema makeSchema() {
> > > > List<Field> fields = new ArrayList<Field>();
> > > > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > > > null));
> > > > fields.add(new Field("age", Schema.create(Type.INT), null,
> > null));
> > > >
> > > > Schema date =
> > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > > fields.add(new Field("doj", date, null, null));
> > > >
> > > > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > > > false);
> > > > schema.setFields(fields);
> > > >
> > > > return(schema);
> > > > }
> > > >
> > > > public static GenericData.Record makeRecord (Schema schema, String
> > name,
> > > > int age, int doj) {
> > > > GenericData.Record record = new GenericData.Record(schema);
> > > > record.put("name", name);
> > > > record.put("age", age);
> > > > record.put("doj", doj);
> > > > return(record);
> > > > }
> > > >
> > > > public static void main(String[] args) throws IOException,
> > > >
> > > > InterruptedException, ClassNotFoundException {
> > > >
> > > > String pqfile = "/tmp/pqtfile1";
> > > >
> > > > try {
> > > >
> > > > Configuration conf = new Configuration();
> > > > FileSystem fs = FileSystem.getLocal(conf);
> > > >
> > > > Schema schema = makeSchema() ;
> > > > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000)
> ;
> > > > AvroParquetWriter writer = new AvroParquetWriter(new
> > > Path(pqfile),
> > > > schema);
> > > > writer.write(rec);
> > > > writer.close();
> > > > }
> > > > catch (Exception e)
> > > > {
> > > > e.printStackTrace();
> > > > }
> > > > }
> > > > }
> > > > =================================================================
> > > >
> > > > With the above logic, I could write the data to parquet-file.
> However,
> > > > when I load the same into a hive-table & select columns, I could
> > select
> > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> successfully,
> > > but
> > > > select of "date" column failed with the error given below:
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED
> AS
> > > > PARQUET ;
> > > > OK
> > > > Time taken: 0.369 seconds
> > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > hive> SELECT name,age from PT1;
> > > > OK
> > > > abcd 21
> > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > hive> SELECT doj from PT1;
> > > > OK
> > > > Failed with exception
> > > >
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> cannot
> > be
> > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > Time taken: 0.167 seconds
> > > > hive>
> > > >
> > > >
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > > >
> > > > Basically, for "date datatype", I am trying to pass an integer-value
> > > (for
> > > > the # of days from Unix epoch, 1 January 1970, so that the date
> falls
> > > > somewhere around 2011..etc). Is this the correct approach to process
> > > date
> > > > data (or is there any other approach / API to do it) ? Could you
> > please
> > > > let me know your inputs, in this regard ?
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > > >
> > > >
> > > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > > To: Parquet Dev <de...@parquet.apache.org>
> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date: 03/09/2016 10:48 PM
> > > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > Hi Ravi,
> > > >
> > > > Not all of the types are fully-implemented yet. I think Hive only
> has
> > > > partial support. If I remember correctly:
> > > > * Decimal is supported if the backing primitive type is fixed-length
> > > > binary
> > > > * Date and Timestamp are supported, but Time has not been
> implemented
> > > yet
> > > >
> > > > For object models you can build applications on (instead of those
> > > embedded
> > > > in SQL), only Avro objects can support those types through its
> > > > LogicalTypes
> > > > API. That API has been implemented in parquet-avro, but not yet
> > > committed.
> > > > I would like for this feature to make it into 1.9.0. If you want to
> > test
> > > > in
> > > > the mean time, check out the pull request:
> > > >
> > > > https://github.com/apache/parquet-mr/pull/318
> > > >
> > > > rb
> > > >
> > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > <ra...@in.ibm.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > test-tool,
> > > > > that writes data to Parquet-files, which can be imported into
> > > > hive-tables.
> > > > > Pl. find attached sample-program, which writes simple
> > > parquet-data-file:
> > > > >
> > > > >
> > > > >
> > > > > Using the above program, I could create "parquet-files" with
> > > data-types:
> > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > supported
> > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > successfully.
> > > > >
> > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > decimal
> > > > > data" into parquet-files. In this context, I request you provide
> > the
> > > > > possible options (and/or sample-program, if any..), in this
> regard.
> > > > >
> > > > > Thanks,
> > > > > Ravi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Regarding the support for "date, timestamp, decimal" data types for
Parquet-files:
In your earlier mail, you have mentioned the pull-request-URL:
https://github.com/apache/parquet-mr/pull/318 has the necessary support
for these data-types (and that it would be released as part of
parquet-avro-release:1.9.0).
I see that, this fix is included in build# 1247 (& above?). How to get
this build (or the latest-build), that includes the JAR-file:
"parquet-avro" including the support for "date,timestamp"..etc. ? Could
you please let me know.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/14/2016 09:56 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Ravi,
Support for those types in parquet-avro hasn't been committed yet. It's
implemented in the branch I pointed you to. If you want to use released
versions, it should be out in 1.9.0.
rb
On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Thanks for the inputs.
>
> I am building & running the test-application, primarily using the
> following JAR-files (for Avro, Parquet-Avro & Hive APIs):
>
> 1) avro-1.8.0.jar
> 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> maven-repository-URL:
> http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> 3) hive-exec-1.2.1.jar
>
> Am I supposed to build/run the test, using a different version of the
> JAR-files ? Could you please let me know.
>
> Thanks,
> Ravi
>
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/11/2016 10:54 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Yes, it is supported in 1.2.1. It went in here:
>
>
>
>
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
>
> Are you using a version of Parquet with that pull request in it? Also,
if
> you're using CDH this may not work.
>
> rb
>
> On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > I am using hive-version: 1.2.1, as indicated below:
> >
> > --------------------------------------
> > $ hive --version
> > Hive 1.2.1
> > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > From source with checksum ab480aca41b24a9c3751b8c023338231
> > $
> > --------------------------------------
> >
> > As I understand, this version of "hive" supports "date" datatype.
right
> ?.
> > Do you want me to re-test using any other higher-version of hive ? Pl.
> let
> > me know your thoughts.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 06:18 AM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > What version of Hive are you using? You should make sure date is
> supported
> > there.
> >
> > rb
> >
> > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Many thanks for the reply. I see that, the text-attachment
containing
> my
> > > test-program is not sent to the mail-group, but got filtered out.
> Hence,
> > > copying the program-code below:
> > >
> > > =================================================================
> > > import java.io.IOException;
> > > import java.util.*;
> > > import org.apache.hadoop.conf.Configuration;
> > > import org.apache.hadoop.fs.FileSystem;
> > > import org.apache.hadoop.fs.Path;
> > > import org.apache.avro.Schema;
> > > import org.apache.avro.Schema.Type;
> > > import org.apache.avro.Schema.Field;
> > > import org.apache.avro.generic.* ;
> > > import org.apache.avro.LogicalTypes;
> > > import org.apache.avro.LogicalTypes.*;
> > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > import parquet.avro.*;
> > >
> > > public class pqtw {
> > >
> > > public static Schema makeSchema() {
> > > List<Field> fields = new ArrayList<Field>();
> > > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > > null));
> > > fields.add(new Field("age", Schema.create(Type.INT), null,
> null));
> > >
> > > Schema date =
> > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > fields.add(new Field("doj", date, null, null));
> > >
> > > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > > false);
> > > schema.setFields(fields);
> > >
> > > return(schema);
> > > }
> > >
> > > public static GenericData.Record makeRecord (Schema schema, String
> name,
> > > int age, int doj) {
> > > GenericData.Record record = new GenericData.Record(schema);
> > > record.put("name", name);
> > > record.put("age", age);
> > > record.put("doj", doj);
> > > return(record);
> > > }
> > >
> > > public static void main(String[] args) throws IOException,
> > >
> > > InterruptedException, ClassNotFoundException {
> > >
> > > String pqfile = "/tmp/pqtfile1";
> > >
> > > try {
> > >
> > > Configuration conf = new Configuration();
> > > FileSystem fs = FileSystem.getLocal(conf);
> > >
> > > Schema schema = makeSchema() ;
> > > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000)
;
> > > AvroParquetWriter writer = new AvroParquetWriter(new
> > Path(pqfile),
> > > schema);
> > > writer.write(rec);
> > > writer.close();
> > > }
> > > catch (Exception e)
> > > {
> > > e.printStackTrace();
> > > }
> > > }
> > > }
> > > =================================================================
> > >
> > > With the above logic, I could write the data to parquet-file.
However,
> > > when I load the same into a hive-table & select columns, I could
> select
> > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
successfully,
> > but
> > > select of "date" column failed with the error given below:
> > >
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED
AS
> > > PARQUET ;
> > > OK
> > > Time taken: 0.369 seconds
> > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > hive> SELECT name,age from PT1;
> > > OK
> > > abcd 21
> > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > hive> SELECT doj from PT1;
> > > OK
> > > Failed with exception
> > >
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
cannot
> be
> > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > Time taken: 0.167 seconds
> > > hive>
> > >
> > >
> >
> >
>
>
--------------------------------------------------------------------------------
> > >
> > > Basically, for "date datatype", I am trying to pass an integer-value
> > (for
> > > the # of days from Unix epoch, 1 January 1970, so that the date
falls
> > > somewhere around 2011..etc). Is this the correct approach to process
> > date
> > > data (or is there any other approach / API to do it) ? Could you
> please
> > > let me know your inputs, in this regard ?
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/09/2016 10:48 PM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Hi Ravi,
> > >
> > > Not all of the types are fully-implemented yet. I think Hive only
has
> > > partial support. If I remember correctly:
> > > * Decimal is supported if the backing primitive type is fixed-length
> > > binary
> > > * Date and Timestamp are supported, but Time has not been
implemented
> > yet
> > >
> > > For object models you can build applications on (instead of those
> > embedded
> > > in SQL), only Avro objects can support those types through its
> > > LogicalTypes
> > > API. That API has been implemented in parquet-avro, but not yet
> > committed.
> > > I would like for this feature to make it into 1.9.0. If you want to
> test
> > > in
> > > the mean time, check out the pull request:
> > >
> > > https://github.com/apache/parquet-mr/pull/318
> > >
> > > rb
> > >
> > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > test-tool,
> > > > that writes data to Parquet-files, which can be imported into
> > > hive-tables.
> > > > Pl. find attached sample-program, which writes simple
> > parquet-data-file:
> > > >
> > > >
> > > >
> > > > Using the above program, I could create "parquet-files" with
> > data-types:
> > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > supported
> > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > successfully.
> > > >
> > > > Now, I am trying to figure out, how to write "date, timestamp,
> decimal
> > > > data" into parquet-files. In this context, I request you provide
> the
> > > > possible options (and/or sample-program, if any..), in this
regard.
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Ravi,
Support for those types in parquet-avro hasn't been committed yet. It's
implemented in the branch I pointed you to. If you want to use released
versions, it should be out in 1.9.0.
rb
On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Thanks for the inputs.
>
> I am building & running the test-application, primarily using the
> following JAR-files (for Avro, Parquet-Avro & Hive APIs):
>
> 1) avro-1.8.0.jar
> 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> maven-repository-URL:
> http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> 3) hive-exec-1.2.1.jar
>
> Am I supposed to build/run the test, using a different version of the
> JAR-files ? Could you please let me know.
>
> Thanks,
> Ravi
>
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/11/2016 10:54 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Yes, it is supported in 1.2.1. It went in here:
>
>
>
> https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
>
>
> Are you using a version of Parquet with that pull request in it? Also, if
> you're using CDH this may not work.
>
> rb
>
> On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > I am using hive-version: 1.2.1, as indicated below:
> >
> > --------------------------------------
> > $ hive --version
> > Hive 1.2.1
> > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > From source with checksum ab480aca41b24a9c3751b8c023338231
> > $
> > --------------------------------------
> >
> > As I understand, this version of "hive" supports "date" datatype. right
> ?.
> > Do you want me to re-test using any other higher-version of hive ? Pl.
> let
> > me know your thoughts.
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/11/2016 06:18 AM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > What version of Hive are you using? You should make sure date is
> supported
> > there.
> >
> > rb
> >
> > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Many thanks for the reply. I see that, the text-attachment containing
> my
> > > test-program is not sent to the mail-group, but got filtered out.
> Hence,
> > > copying the program-code below:
> > >
> > > =================================================================
> > > import java.io.IOException;
> > > import java.util.*;
> > > import org.apache.hadoop.conf.Configuration;
> > > import org.apache.hadoop.fs.FileSystem;
> > > import org.apache.hadoop.fs.Path;
> > > import org.apache.avro.Schema;
> > > import org.apache.avro.Schema.Type;
> > > import org.apache.avro.Schema.Field;
> > > import org.apache.avro.generic.* ;
> > > import org.apache.avro.LogicalTypes;
> > > import org.apache.avro.LogicalTypes.*;
> > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > import parquet.avro.*;
> > >
> > > public class pqtw {
> > >
> > > public static Schema makeSchema() {
> > > List<Field> fields = new ArrayList<Field>();
> > > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > > null));
> > > fields.add(new Field("age", Schema.create(Type.INT), null,
> null));
> > >
> > > Schema date =
> > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > fields.add(new Field("doj", date, null, null));
> > >
> > > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > > false);
> > > schema.setFields(fields);
> > >
> > > return(schema);
> > > }
> > >
> > > public static GenericData.Record makeRecord (Schema schema, String
> name,
> > > int age, int doj) {
> > > GenericData.Record record = new GenericData.Record(schema);
> > > record.put("name", name);
> > > record.put("age", age);
> > > record.put("doj", doj);
> > > return(record);
> > > }
> > >
> > > public static void main(String[] args) throws IOException,
> > >
> > > InterruptedException, ClassNotFoundException {
> > >
> > > String pqfile = "/tmp/pqtfile1";
> > >
> > > try {
> > >
> > > Configuration conf = new Configuration();
> > > FileSystem fs = FileSystem.getLocal(conf);
> > >
> > > Schema schema = makeSchema() ;
> > > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> > > AvroParquetWriter writer = new AvroParquetWriter(new
> > Path(pqfile),
> > > schema);
> > > writer.write(rec);
> > > writer.close();
> > > }
> > > catch (Exception e)
> > > {
> > > e.printStackTrace();
> > > }
> > > }
> > > }
> > > =================================================================
> > >
> > > With the above logic, I could write the data to parquet-file. However,
> > > when I load the same into a hive-table & select columns, I could
> select
> > > the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully,
> > but
> > > select of "date" column failed with the error given below:
> > >
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> > > PARQUET ;
> > > OK
> > > Time taken: 0.369 seconds
> > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > hive> SELECT name,age from PT1;
> > > OK
> > > abcd 21
> > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > hive> SELECT doj from PT1;
> > > OK
> > > Failed with exception
> > > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot
> be
> > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > Time taken: 0.167 seconds
> > > hive>
> > >
> > >
> >
> >
>
> --------------------------------------------------------------------------------
> > >
> > > Basically, for "date datatype", I am trying to pass an integer-value
> > (for
> > > the # of days from Unix epoch, 1 January 1970, so that the date falls
> > > somewhere around 2011..etc). Is this the correct approach to process
> > date
> > > data (or is there any other approach / API to do it) ? Could you
> please
> > > let me know your inputs, in this regard ?
> > >
> > > Thanks,
> > > Ravi
> > >
> > >
> > >
> > > From: Ryan Blue <rb...@netflix.com.INVALID>
> > > To: Parquet Dev <de...@parquet.apache.org>
> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date: 03/09/2016 10:48 PM
> > > Subject: Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Hi Ravi,
> > >
> > > Not all of the types are fully-implemented yet. I think Hive only has
> > > partial support. If I remember correctly:
> > > * Decimal is supported if the backing primitive type is fixed-length
> > > binary
> > > * Date and Timestamp are supported, but Time has not been implemented
> > yet
> > >
> > > For object models you can build applications on (instead of those
> > embedded
> > > in SQL), only Avro objects can support those types through its
> > > LogicalTypes
> > > API. That API has been implemented in parquet-avro, but not yet
> > committed.
> > > I would like for this feature to make it into 1.9.0. If you want to
> test
> > > in
> > > the mean time, check out the pull request:
> > >
> > > https://github.com/apache/parquet-mr/pull/318
> > >
> > > rb
> > >
> > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> <ra...@in.ibm.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > test-tool,
> > > > that writes data to Parquet-files, which can be imported into
> > > hive-tables.
> > > > Pl. find attached sample-program, which writes simple
> > parquet-data-file:
> > > >
> > > >
> > > >
> > > > Using the above program, I could create "parquet-files" with
> > data-types:
> > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > supported
> > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > successfully.
> > > >
> > > > Now, I am trying to figure out, how to write "date, timestamp,
> decimal
> > > > data" into parquet-files. In this context, I request you provide
> the
> > > > possible options (and/or sample-program, if any..), in this regard.
> > > >
> > > > Thanks,
> > > > Ravi
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Thanks for the inputs.
I am building & running the test-application, primarily using the
following JAR-files (for Avro, Parquet-Avro & Hive APIs):
1) avro-1.8.0.jar
2) parquet-avro-1.6.0.jar (This is the latest one, found in the
maven-repository-URL:
http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
3) hive-exec-1.2.1.jar
Am I supposed to build/run the test, using a different version of the
JAR-files ? Could you please let me know.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/11/2016 10:54 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Yes, it is supported in 1.2.1. It went in here:
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
Are you using a version of Parquet with that pull request in it? Also, if
you're using CDH this may not work.
rb
On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> I am using hive-version: 1.2.1, as indicated below:
>
> --------------------------------------
> $ hive --version
> Hive 1.2.1
> Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> From source with checksum ab480aca41b24a9c3751b8c023338231
> $
> --------------------------------------
>
> As I understand, this version of "hive" supports "date" datatype. right
?.
> Do you want me to re-test using any other higher-version of hive ? Pl.
let
> me know your thoughts.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/11/2016 06:18 AM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> What version of Hive are you using? You should make sure date is
supported
> there.
>
> rb
>
> On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
<ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Many thanks for the reply. I see that, the text-attachment containing
my
> > test-program is not sent to the mail-group, but got filtered out.
Hence,
> > copying the program-code below:
> >
> > =================================================================
> > import java.io.IOException;
> > import java.util.*;
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.avro.Schema;
> > import org.apache.avro.Schema.Type;
> > import org.apache.avro.Schema.Field;
> > import org.apache.avro.generic.* ;
> > import org.apache.avro.LogicalTypes;
> > import org.apache.avro.LogicalTypes.*;
> > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > import parquet.avro.*;
> >
> > public class pqtw {
> >
> > public static Schema makeSchema() {
> > List<Field> fields = new ArrayList<Field>();
> > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > null));
> > fields.add(new Field("age", Schema.create(Type.INT), null,
null));
> >
> > Schema date =
> > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > fields.add(new Field("doj", date, null, null));
> >
> > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > false);
> > schema.setFields(fields);
> >
> > return(schema);
> > }
> >
> > public static GenericData.Record makeRecord (Schema schema, String
name,
> > int age, int doj) {
> > GenericData.Record record = new GenericData.Record(schema);
> > record.put("name", name);
> > record.put("age", age);
> > record.put("doj", doj);
> > return(record);
> > }
> >
> > public static void main(String[] args) throws IOException,
> >
> > InterruptedException, ClassNotFoundException {
> >
> > String pqfile = "/tmp/pqtfile1";
> >
> > try {
> >
> > Configuration conf = new Configuration();
> > FileSystem fs = FileSystem.getLocal(conf);
> >
> > Schema schema = makeSchema() ;
> > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> > AvroParquetWriter writer = new AvroParquetWriter(new
> Path(pqfile),
> > schema);
> > writer.write(rec);
> > writer.close();
> > }
> > catch (Exception e)
> > {
> > e.printStackTrace();
> > }
> > }
> > }
> > =================================================================
> >
> > With the above logic, I could write the data to parquet-file. However,
> > when I load the same into a hive-table & select columns, I could
select
> > the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully,
> but
> > select of "date" column failed with the error given below:
> >
> >
> >
>
>
--------------------------------------------------------------------------------
> > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> > PARQUET ;
> > OK
> > Time taken: 0.369 seconds
> > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > hive> SELECT name,age from PT1;
> > OK
> > abcd 21
> > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > hive> SELECT doj from PT1;
> > OK
> > Failed with exception
> > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot
be
> > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > Time taken: 0.167 seconds
> > hive>
> >
> >
>
>
--------------------------------------------------------------------------------
> >
> > Basically, for "date datatype", I am trying to pass an integer-value
> (for
> > the # of days from Unix epoch, 1 January 1970, so that the date falls
> > somewhere around 2011..etc). Is this the correct approach to process
> date
> > data (or is there any other approach / API to do it) ? Could you
please
> > let me know your inputs, in this regard ?
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/09/2016 10:48 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Hi Ravi,
> >
> > Not all of the types are fully-implemented yet. I think Hive only has
> > partial support. If I remember correctly:
> > * Decimal is supported if the backing primitive type is fixed-length
> > binary
> > * Date and Timestamp are supported, but Time has not been implemented
> yet
> >
> > For object models you can build applications on (instead of those
> embedded
> > in SQL), only Avro objects can support those types through its
> > LogicalTypes
> > API. That API has been implemented in parquet-avro, but not yet
> committed.
> > I would like for this feature to make it into 1.9.0. If you want to
test
> > in
> > the mean time, check out the pull request:
> >
> > https://github.com/apache/parquet-mr/pull/318
> >
> > rb
> >
> > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
<ra...@in.ibm.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> test-tool,
> > > that writes data to Parquet-files, which can be imported into
> > hive-tables.
> > > Pl. find attached sample-program, which writes simple
> parquet-data-file:
> > >
> > >
> > >
> > > Using the above program, I could create "parquet-files" with
> data-types:
> > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > supported
> > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > successfully.
> > >
> > > Now, I am trying to figure out, how to write "date, timestamp,
decimal
> > > data" into parquet-files. In this context, I request you provide
the
> > > possible options (and/or sample-program, if any..), in this regard.
> > >
> > > Thanks,
> > > Ravi
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Yes, it is supported in 1.2.1. It went in here:
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b
Are you using a version of Parquet with that pull request in it? Also, if
you're using CDH this may not work.
rb
On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> I am using hive-version: 1.2.1, as indicated below:
>
> --------------------------------------
> $ hive --version
> Hive 1.2.1
> Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> From source with checksum ab480aca41b24a9c3751b8c023338231
> $
> --------------------------------------
>
> As I understand, this version of "hive" supports "date" datatype. right ?.
> Do you want me to re-test using any other higher-version of hive ? Pl. let
> me know your thoughts.
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/11/2016 06:18 AM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> What version of Hive are you using? You should make sure date is supported
> there.
>
> rb
>
> On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello Ryan:
> >
> > Many thanks for the reply. I see that, the text-attachment containing my
> > test-program is not sent to the mail-group, but got filtered out. Hence,
> > copying the program-code below:
> >
> > =================================================================
> > import java.io.IOException;
> > import java.util.*;
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.avro.Schema;
> > import org.apache.avro.Schema.Type;
> > import org.apache.avro.Schema.Field;
> > import org.apache.avro.generic.* ;
> > import org.apache.avro.LogicalTypes;
> > import org.apache.avro.LogicalTypes.*;
> > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > import parquet.avro.*;
> >
> > public class pqtw {
> >
> > public static Schema makeSchema() {
> > List<Field> fields = new ArrayList<Field>();
> > fields.add(new Field("name", Schema.create(Type.STRING), null,
> > null));
> > fields.add(new Field("age", Schema.create(Type.INT), null, null));
> >
> > Schema date =
> > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > fields.add(new Field("doj", date, null, null));
> >
> > Schema schema = Schema.createRecord("filecc", null, "parquet",
> > false);
> > schema.setFields(fields);
> >
> > return(schema);
> > }
> >
> > public static GenericData.Record makeRecord (Schema schema, String name,
> > int age, int doj) {
> > GenericData.Record record = new GenericData.Record(schema);
> > record.put("name", name);
> > record.put("age", age);
> > record.put("doj", doj);
> > return(record);
> > }
> >
> > public static void main(String[] args) throws IOException,
> >
> > InterruptedException, ClassNotFoundException {
> >
> > String pqfile = "/tmp/pqtfile1";
> >
> > try {
> >
> > Configuration conf = new Configuration();
> > FileSystem fs = FileSystem.getLocal(conf);
> >
> > Schema schema = makeSchema() ;
> > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> > AvroParquetWriter writer = new AvroParquetWriter(new
> Path(pqfile),
> > schema);
> > writer.write(rec);
> > writer.close();
> > }
> > catch (Exception e)
> > {
> > e.printStackTrace();
> > }
> > }
> > }
> > =================================================================
> >
> > With the above logic, I could write the data to parquet-file. However,
> > when I load the same into a hive-table & select columns, I could select
> > the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully,
> but
> > select of "date" column failed with the error given below:
> >
> >
> >
>
> --------------------------------------------------------------------------------
> > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> > PARQUET ;
> > OK
> > Time taken: 0.369 seconds
> > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > hive> SELECT name,age from PT1;
> > OK
> > abcd 21
> > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > hive> SELECT doj from PT1;
> > OK
> > Failed with exception
> > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
> > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > Time taken: 0.167 seconds
> > hive>
> >
> >
>
> --------------------------------------------------------------------------------
> >
> > Basically, for "date datatype", I am trying to pass an integer-value
> (for
> > the # of days from Unix epoch, 1 January 1970, so that the date falls
> > somewhere around 2011..etc). Is this the correct approach to process
> date
> > data (or is there any other approach / API to do it) ? Could you please
> > let me know your inputs, in this regard ?
> >
> > Thanks,
> > Ravi
> >
> >
> >
> > From: Ryan Blue <rb...@netflix.com.INVALID>
> > To: Parquet Dev <de...@parquet.apache.org>
> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date: 03/09/2016 10:48 PM
> > Subject: Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Hi Ravi,
> >
> > Not all of the types are fully-implemented yet. I think Hive only has
> > partial support. If I remember correctly:
> > * Decimal is supported if the backing primitive type is fixed-length
> > binary
> > * Date and Timestamp are supported, but Time has not been implemented
> yet
> >
> > For object models you can build applications on (instead of those
> embedded
> > in SQL), only Avro objects can support those types through its
> > LogicalTypes
> > API. That API has been implemented in parquet-avro, but not yet
> committed.
> > I would like for this feature to make it into 1.9.0. If you want to test
> > in
> > the mean time, check out the pull request:
> >
> > https://github.com/apache/parquet-mr/pull/318
> >
> > rb
> >
> > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <ra...@in.ibm.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> test-tool,
> > > that writes data to Parquet-files, which can be imported into
> > hive-tables.
> > > Pl. find attached sample-program, which writes simple
> parquet-data-file:
> > >
> > >
> > >
> > > Using the above program, I could create "parquet-files" with
> data-types:
> > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > supported
> > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > successfully.
> > >
> > > Now, I am trying to figure out, how to write "date, timestamp, decimal
> > > data" into parquet-files. In this context, I request you provide the
> > > possible options (and/or sample-program, if any..), in this regard.
> > >
> > > Thanks,
> > > Ravi
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
I am using hive-version: 1.2.1, as indicated below:
--------------------------------------
$ hive --version
Hive 1.2.1
Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
>From source with checksum ab480aca41b24a9c3751b8c023338231
$
--------------------------------------
As I understand, this version of "hive" supports "date" datatype. right ?.
Do you want me to re-test using any other higher-version of hive ? Pl. let
me know your thoughts.
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/11/2016 06:18 AM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
What version of Hive are you using? You should make sure date is supported
there.
rb
On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Many thanks for the reply. I see that, the text-attachment containing my
> test-program is not sent to the mail-group, but got filtered out. Hence,
> copying the program-code below:
>
> =================================================================
> import java.io.IOException;
> import java.util.*;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.avro.Schema;
> import org.apache.avro.Schema.Type;
> import org.apache.avro.Schema.Field;
> import org.apache.avro.generic.* ;
> import org.apache.avro.LogicalTypes;
> import org.apache.avro.LogicalTypes.*;
> import org.apache.hadoop.hive.common.type.HiveDecimal;
> import parquet.avro.*;
>
> public class pqtw {
>
> public static Schema makeSchema() {
> List<Field> fields = new ArrayList<Field>();
> fields.add(new Field("name", Schema.create(Type.STRING), null,
> null));
> fields.add(new Field("age", Schema.create(Type.INT), null, null));
>
> Schema date =
> LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> fields.add(new Field("doj", date, null, null));
>
> Schema schema = Schema.createRecord("filecc", null, "parquet",
> false);
> schema.setFields(fields);
>
> return(schema);
> }
>
> public static GenericData.Record makeRecord (Schema schema, String name,
> int age, int doj) {
> GenericData.Record record = new GenericData.Record(schema);
> record.put("name", name);
> record.put("age", age);
> record.put("doj", doj);
> return(record);
> }
>
> public static void main(String[] args) throws IOException,
>
> InterruptedException, ClassNotFoundException {
>
> String pqfile = "/tmp/pqtfile1";
>
> try {
>
> Configuration conf = new Configuration();
> FileSystem fs = FileSystem.getLocal(conf);
>
> Schema schema = makeSchema() ;
> GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> AvroParquetWriter writer = new AvroParquetWriter(new
Path(pqfile),
> schema);
> writer.write(rec);
> writer.close();
> }
> catch (Exception e)
> {
> e.printStackTrace();
> }
> }
> }
> =================================================================
>
> With the above logic, I could write the data to parquet-file. However,
> when I load the same into a hive-table & select columns, I could select
> the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully,
but
> select of "date" column failed with the error given below:
>
>
>
--------------------------------------------------------------------------------
> hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> PARQUET ;
> OK
> Time taken: 0.369 seconds
> hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> hive> SELECT name,age from PT1;
> OK
> abcd 21
> Time taken: 0.311 seconds, Fetched: 1 row(s)
> hive> SELECT doj from PT1;
> OK
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
> cast to org.apache.hadoop.hive.serde2.io.DateWritable
> Time taken: 0.167 seconds
> hive>
>
>
--------------------------------------------------------------------------------
>
> Basically, for "date datatype", I am trying to pass an integer-value
(for
> the # of days from Unix epoch, 1 January 1970, so that the date falls
> somewhere around 2011..etc). Is this the correct approach to process
date
> data (or is there any other approach / API to do it) ? Could you please
> let me know your inputs, in this regard ?
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/09/2016 10:48 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Hi Ravi,
>
> Not all of the types are fully-implemented yet. I think Hive only has
> partial support. If I remember correctly:
> * Decimal is supported if the backing primitive type is fixed-length
> binary
> * Date and Timestamp are supported, but Time has not been implemented
yet
>
> For object models you can build applications on (instead of those
embedded
> in SQL), only Avro objects can support those types through its
> LogicalTypes
> API. That API has been implemented in parquet-avro, but not yet
committed.
> I would like for this feature to make it into 1.9.0. If you want to test
> in
> the mean time, check out the pull request:
>
> https://github.com/apache/parquet-mr/pull/318
>
> rb
>
> On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello,
> >
> > I am Ravi Tatapudi, from IBM-India. I am working on a simple
test-tool,
> > that writes data to Parquet-files, which can be imported into
> hive-tables.
> > Pl. find attached sample-program, which writes simple
parquet-data-file:
> >
> >
> >
> > Using the above program, I could create "parquet-files" with
data-types:
> > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> supported
> > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > successfully.
> >
> > Now, I am trying to figure out, how to write "date, timestamp, decimal
> > data" into parquet-files. In this context, I request you provide the
> > possible options (and/or sample-program, if any..), in this regard.
> >
> > Thanks,
> > Ravi
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
What version of Hive are you using? You should make sure date is supported
there.
rb
On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello Ryan:
>
> Many thanks for the reply. I see that, the text-attachment containing my
> test-program is not sent to the mail-group, but got filtered out. Hence,
> copying the program-code below:
>
> =================================================================
> import java.io.IOException;
> import java.util.*;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.avro.Schema;
> import org.apache.avro.Schema.Type;
> import org.apache.avro.Schema.Field;
> import org.apache.avro.generic.* ;
> import org.apache.avro.LogicalTypes;
> import org.apache.avro.LogicalTypes.*;
> import org.apache.hadoop.hive.common.type.HiveDecimal;
> import parquet.avro.*;
>
> public class pqtw {
>
> public static Schema makeSchema() {
> List<Field> fields = new ArrayList<Field>();
> fields.add(new Field("name", Schema.create(Type.STRING), null,
> null));
> fields.add(new Field("age", Schema.create(Type.INT), null, null));
>
> Schema date =
> LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> fields.add(new Field("doj", date, null, null));
>
> Schema schema = Schema.createRecord("filecc", null, "parquet",
> false);
> schema.setFields(fields);
>
> return(schema);
> }
>
> public static GenericData.Record makeRecord (Schema schema, String name,
> int age, int doj) {
> GenericData.Record record = new GenericData.Record(schema);
> record.put("name", name);
> record.put("age", age);
> record.put("doj", doj);
> return(record);
> }
>
> public static void main(String[] args) throws IOException,
>
> InterruptedException, ClassNotFoundException {
>
> String pqfile = "/tmp/pqtfile1";
>
> try {
>
> Configuration conf = new Configuration();
> FileSystem fs = FileSystem.getLocal(conf);
>
> Schema schema = makeSchema() ;
> GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
> schema);
> writer.write(rec);
> writer.close();
> }
> catch (Exception e)
> {
> e.printStackTrace();
> }
> }
> }
> =================================================================
>
> With the above logic, I could write the data to parquet-file. However,
> when I load the same into a hive-table & select columns, I could select
> the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, but
> select of "date" column failed with the error given below:
>
>
> --------------------------------------------------------------------------------
> hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> PARQUET ;
> OK
> Time taken: 0.369 seconds
> hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> hive> SELECT name,age from PT1;
> OK
> abcd 21
> Time taken: 0.311 seconds, Fetched: 1 row(s)
> hive> SELECT doj from PT1;
> OK
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
> cast to org.apache.hadoop.hive.serde2.io.DateWritable
> Time taken: 0.167 seconds
> hive>
>
> --------------------------------------------------------------------------------
>
> Basically, for "date datatype", I am trying to pass an integer-value (for
> the # of days from Unix epoch, 1 January 1970, so that the date falls
> somewhere around 2011..etc). Is this the correct approach to process date
> data (or is there any other approach / API to do it) ? Could you please
> let me know your inputs, in this regard ?
>
> Thanks,
> Ravi
>
>
>
> From: Ryan Blue <rb...@netflix.com.INVALID>
> To: Parquet Dev <de...@parquet.apache.org>
> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date: 03/09/2016 10:48 PM
> Subject: Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Hi Ravi,
>
> Not all of the types are fully-implemented yet. I think Hive only has
> partial support. If I remember correctly:
> * Decimal is supported if the backing primitive type is fixed-length
> binary
> * Date and Timestamp are supported, but Time has not been implemented yet
>
> For object models you can build applications on (instead of those embedded
> in SQL), only Avro objects can support those types through its
> LogicalTypes
> API. That API has been implemented in parquet-avro, but not yet committed.
> I would like for this feature to make it into 1.9.0. If you want to test
> in
> the mean time, check out the pull request:
>
> https://github.com/apache/parquet-mr/pull/318
>
> rb
>
> On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <ra...@in.ibm.com>
> wrote:
>
> > Hello,
> >
> > I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
> > that writes data to Parquet-files, which can be imported into
> hive-tables.
> > Pl. find attached sample-program, which writes simple parquet-data-file:
> >
> >
> >
> > Using the above program, I could create "parquet-files" with data-types:
> > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> supported
> > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > successfully.
> >
> > Now, I am trying to figure out, how to write "date, timestamp, decimal
> > data" into parquet-files. In this context, I request you provide the
> > possible options (and/or sample-program, if any..), in this regard.
> >
> > Thanks,
> > Ravi
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ravi Tatapudi <ra...@in.ibm.com>.
Hello Ryan:
Many thanks for the reply. I see that, the text-attachment containing my
test-program is not sent to the mail-group, but got filtered out. Hence,
copying the program-code below:
=================================================================
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Type;
import org.apache.avro.Schema.Field;
import org.apache.avro.generic.* ;
import org.apache.avro.LogicalTypes;
import org.apache.avro.LogicalTypes.*;
import org.apache.hadoop.hive.common.type.HiveDecimal;
import parquet.avro.*;
public class pqtw {
public static Schema makeSchema() {
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("name", Schema.create(Type.STRING), null,
null));
fields.add(new Field("age", Schema.create(Type.INT), null, null));
Schema date =
LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
fields.add(new Field("doj", date, null, null));
Schema schema = Schema.createRecord("filecc", null, "parquet",
false);
schema.setFields(fields);
return(schema);
}
public static GenericData.Record makeRecord (Schema schema, String name,
int age, int doj) {
GenericData.Record record = new GenericData.Record(schema);
record.put("name", name);
record.put("age", age);
record.put("doj", doj);
return(record);
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String pqfile = "/tmp/pqtfile1";
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Schema schema = makeSchema() ;
GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
schema);
writer.write(rec);
writer.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
=================================================================
With the above logic, I could write the data to parquet-file. However,
when I load the same into a hive-table & select columns, I could select
the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, but
select of "date" column failed with the error given below:
--------------------------------------------------------------------------------
hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
PARQUET ;
OK
Time taken: 0.369 seconds
hive> load data local inpath '/tmp/pqtfile1' into table PT1;
hive> SELECT name,age from PT1;
OK
abcd 21
Time taken: 0.311 seconds, Fetched: 1 row(s)
hive> SELECT doj from PT1;
OK
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
cast to org.apache.hadoop.hive.serde2.io.DateWritable
Time taken: 0.167 seconds
hive>
--------------------------------------------------------------------------------
Basically, for "date datatype", I am trying to pass an integer-value (for
the # of days from Unix epoch, 1 January 1970, so that the date falls
somewhere around 2011..etc). Is this the correct approach to process date
data (or is there any other approach / API to do it) ? Could you please
let me know your inputs, in this regard ?
Thanks,
Ravi
From: Ryan Blue <rb...@netflix.com.INVALID>
To: Parquet Dev <de...@parquet.apache.org>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/09/2016 10:48 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Hi Ravi,
Not all of the types are fully-implemented yet. I think Hive only has
partial support. If I remember correctly:
* Decimal is supported if the backing primitive type is fixed-length
binary
* Date and Timestamp are supported, but Time has not been implemented yet
For object models you can build applications on (instead of those embedded
in SQL), only Avro objects can support those types through its
LogicalTypes
API. That API has been implemented in parquet-avro, but not yet committed.
I would like for this feature to make it into 1.9.0. If you want to test
in
the mean time, check out the pull request:
https://github.com/apache/parquet-mr/pull/318
rb
On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello,
>
> I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
> that writes data to Parquet-files, which can be imported into
hive-tables.
> Pl. find attached sample-program, which writes simple parquet-data-file:
>
>
>
> Using the above program, I could create "parquet-files" with data-types:
> INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
supported
> by "org.apache.avro.Schema.Type) & load it into "hive" tables
> successfully.
>
> Now, I am trying to figure out, how to write "date, timestamp, decimal
> data" into parquet-files. In this context, I request you provide the
> possible options (and/or sample-program, if any..), in this regard.
>
> Thanks,
> Ravi
>
--
Ryan Blue
Software Engineer
Netflix
Re: How to write "date, timestamp, decimal" data to Parquet-files
Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Ravi,
Not all of the types are fully-implemented yet. I think Hive only has
partial support. If I remember correctly:
* Decimal is supported if the backing primitive type is fixed-length binary
* Date and Timestamp are supported, but Time has not been implemented yet
For object models you can build applications on (instead of those embedded
in SQL), only Avro objects can support those types through its LogicalTypes
API. That API has been implemented in parquet-avro, but not yet committed.
I would like for this feature to make it into 1.9.0. If you want to test in
the mean time, check out the pull request:
https://github.com/apache/parquet-mr/pull/318
rb
On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <ra...@in.ibm.com>
wrote:
> Hello,
>
> I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
> that writes data to Parquet-files, which can be imported into hive-tables.
> Pl. find attached sample-program, which writes simple parquet-data-file:
>
>
>
> Using the above program, I could create "parquet-files" with data-types:
> INT, LONG, STRING, Boolean...etc (i.e., basically all data-types supported
> by "org.apache.avro.Schema.Type) & load it into "hive" tables
> successfully.
>
> Now, I am trying to figure out, how to write "date, timestamp, decimal
> data" into parquet-files. In this context, I request you provide the
> possible options (and/or sample-program, if any..), in this regard.
>
> Thanks,
> Ravi
>
--
Ryan Blue
Software Engineer
Netflix