You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Almeida, Julius" <Ju...@intuit.com.INVALID> on 2020/10/16 18:25:49 UTC

Store timestamp with Logical Types

Hi Team,

I wish to store timestamp(eg: 2020-10-13 13:01:0) in parquet using avro schema to load to hive.
but I don’t see any option to do so, I pass it as a string, but hive fails to read it as it is expecting timestamp
I am using ParquetIO from apache beam to write output files in parquet format.

avro schema:
{
                "type": "record",
                "name": "TestRecord",
                "namespace": "com.test-v2.avro",
                "fields": [{
                                "name": "reason",
                                "type": "string"
                }, {
                                "name": "event",
                                "type": "string"
                }, {
                                "name": "timestamp",
                                "type": "string"
                }]
}
output Event:
{
                "reason": "Invalid",
                "event": "dropped",
                "timestamp": "2020-10-13 13:01:05"
}
Hive schema is
reason ->  varchar(100),
event -> varchar(100),
timestamp -> timestamp


but when trying to query hive it fails to read timestamp column since input event has string type.

I see this Jira in place https://issues.apache.org/jira/browse/AVRO-2924

And I believe the code change can be done here :  https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java

Correct me if I am wrong.

Thanks,
Julius

Re: Store timestamp with Logical Types

Posted by Jeremy Custenborder <jc...@gmail.com>.
Timestamps in avro are represented by a long. Your best bet is to
convert the string to the long representation. For example to ms since
epoch, then write to avro. There is a logical type for this. Hive will
automatically read this as a timestamp. If you already have the data
in a hive table as a string, you can do a query to parse the string
timestamp to an actual timestamp, then write to parquet. Long story
short I would change whatever is writing the avro file to use the
proper timestamp encoding.

https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29

On Fri, Oct 16, 2020 at 1:26 PM Almeida, Julius
<Ju...@intuit.com.invalid> wrote:
>
>
> Hi Team,
>
> I wish to store timestamp(eg: 2020-10-13 13:01:0) in parquet using avro schema to load to hive.
> but I don’t see any option to do so, I pass it as a string, but hive fails to read it as it is expecting timestamp
> I am using ParquetIO from apache beam to write output files in parquet format.
>
> avro schema:
> {
>                 "type": "record",
>                 "name": "TestRecord",
>                 "namespace": "com.test-v2.avro",
>                 "fields": [{
>                                 "name": "reason",
>                                 "type": "string"
>                 }, {
>                                 "name": "event",
>                                 "type": "string"
>                 }, {
>                                 "name": "timestamp",
>                                 "type": "string"
>                 }]
> }
> output Event:
> {
>                 "reason": "Invalid",
>                 "event": "dropped",
>                 "timestamp": "2020-10-13 13:01:05"
> }
> Hive schema is
> reason ->  varchar(100),
> event -> varchar(100),
> timestamp -> timestamp
>
>
> but when trying to query hive it fails to read timestamp column since input event has string type.
>
> I see this Jira in place https://issues.apache.org/jira/browse/AVRO-2924
>
> And I believe the code change can be done here :  https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java
>
> Correct me if I am wrong.
>
> Thanks,
> Julius